About the job
At Serve Robotics, we are transforming urban mobility with our innovative sidewalk robot, which embodies our vision for a future where deliveries are efficient and accessible. Our robots are designed to navigate congested streets, making deliveries available to a broader audience and supporting local businesses.
The Serve fleet has been successfully delivering joy to merchants, customers, and pedestrians across major cities like Los Angeles, Miami, Dallas, Atlanta, and Chicago. We are looking for passionate individuals who can help evolve robotic deliveries from a fascinating novelty to a seamless, everyday occurrence.
About Us
We are a team of seasoned professionals from the tech industry, specializing in software, hardware, and design, united in our mission to create the future we envision. Our focus is on addressing real-world challenges using robotics, machine learning, and computer vision, all while ensuring an exceptional user experience. Our diverse and agile team thrives on collaboration and respect, believing that complex problems are best solved together.
Role Overview
As the Senior Reliability Operations Engineer, you will be pivotal in enhancing operational reliability across our regional operations. You will oversee incident response, manage escalations, and provide Tier 2 support for both robotic and cloud systems. This role involves developing and refining runbooks, automations, and operational processes, working closely with product engineering and Site Reliability Engineers (SREs). You will act as the regional incident lead, ensuring timely resolution of issues and clear communication with all stakeholders.
Your Responsibilities
Act as the primary incident lead during your region's operational hours, coordinating technical investigations, centralizing communication, and engaging relevant engineering and SRE teams for escalations.
Address escalations from Tier 1 support, utilizing runbooks, metrics, logs, and system diagnostics to troubleshoot and resolve issues or escalate to Tier 3 as necessary.
Create and maintain runbooks, workflows, and operational documentation to ensure consistent responses to recurring issues, collaborating with product teams to enhance coverage over time.
Develop, maintain, and enhance automation scripts and tools to streamline common remediation processes, improving response times and minimizing manual operational tasks.
Utilize metrics, logs, and tracing tools (like Grafana/Prometheus, GCP Monitoring, OpenTelemetry) to proactively identify issues, validate system behavior, and drive continuous improvement in detection methods.

