About the job
Agency Notice: We are not currently collaborating with recruiting agencies for this role. We kindly ask that you refrain from contacting Vizcom employees regarding this position. Any resumes submitted without prior agreement will be considered unsolicited.
About Vizcom
Vizcom is a cutting-edge visual creation platform that merges advanced web tooling with AI-driven workflows. Our technology stack incorporates React/TypeScript for the front end, Node/Koa + PostGraphile for API services, PostgreSQL, Redis, BullMQ for queuing, and a Kubernetes-based production infrastructure.
We are seeking a seasoned expert to oversee platform stability and infrastructure, ensuring our system remains reliable, efficient, and resilient as we scale.
Role Mission
Take full ownership of service reliability: proactively prevent incidents, minimize impact during failures, and guide swift, high-quality recovery during production downtimes.
This role involves hands-on technical leadership, granting you the authority to establish reliability standards and enforce production protocols.
Compensation
Base salary between $200,000 and $250,000, plus significant equity.
Your Responsibilities
Reliability Standards: Define and uphold SLIs/SLOs/error budgets for key user interactions.
Resilience of Production Architecture: Implement failure isolation across APIs, workers, queues, and interdependencies to ensure one subsystem's failure does not disrupt core access.
Kubernetes Runtime Reliability: Establish probe contracts, deployment standards, graceful shutdown protocols, scaling/resource policies, and startup safety measures.
Queue & Job Safety (BullMQ/Redis): Manage poison pill containment and workload segregation.
Incident Command Quality: Lead Sev1/Sev2 incident responses from containment to corrective actions.
Reliability Operating System: Oversee observability quality (prioritizing signal over noise), on-call efficiency, runbook maintenance, and postmortem discipline.
Deployment Safety Authority: Gate risky deployments and enforce reliability protocols whenever production health is compromised.

