companyVizcom logo

Senior Platform & Reliability Engineer at Vizcom | San Francisco

VizcomSan Francisco
On-site Full-time $200K/yr - $250K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

To thrive in this role, you should have a strong background in platform reliability engineering with a focus on incident management, Kubernetes, and API architecture. You should also possess excellent communication skills and the ability to work collaboratively across teams.

About the job

Agency Notice: We are not currently collaborating with recruiting agencies for this role. We kindly ask that you refrain from contacting Vizcom employees regarding this position. Any resumes submitted without prior agreement will be considered unsolicited.

About Vizcom

Vizcom is a cutting-edge visual creation platform that merges advanced web tooling with AI-driven workflows. Our technology stack incorporates React/TypeScript for the front end, Node/Koa + PostGraphile for API services, PostgreSQL, Redis, BullMQ for queuing, and a Kubernetes-based production infrastructure.

We are seeking a seasoned expert to oversee platform stability and infrastructure, ensuring our system remains reliable, efficient, and resilient as we scale.

Role Mission

Take full ownership of service reliability: proactively prevent incidents, minimize impact during failures, and guide swift, high-quality recovery during production downtimes.

This role involves hands-on technical leadership, granting you the authority to establish reliability standards and enforce production protocols.

Compensation

Base salary between $200,000 and $250,000, plus significant equity.


Your Responsibilities

  • Reliability Standards: Define and uphold SLIs/SLOs/error budgets for key user interactions.

  • Resilience of Production Architecture: Implement failure isolation across APIs, workers, queues, and interdependencies to ensure one subsystem's failure does not disrupt core access.

  • Kubernetes Runtime Reliability: Establish probe contracts, deployment standards, graceful shutdown protocols, scaling/resource policies, and startup safety measures.

  • Queue & Job Safety (BullMQ/Redis): Manage poison pill containment and workload segregation.

  • Incident Command Quality: Lead Sev1/Sev2 incident responses from containment to corrective actions.

  • Reliability Operating System: Oversee observability quality (prioritizing signal over noise), on-call efficiency, runbook maintenance, and postmortem discipline.

  • Deployment Safety Authority: Gate risky deployments and enforce reliability protocols whenever production health is compromised.

About Vizcom

At Vizcom, we are revolutionizing visual creation through our sophisticated platform that leverages the latest advancements in AI. Our collaborative environment encourages innovation and creativity, making us a leader in the industry.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.