About the job
About Us
At Sierra, we are pioneering a transformative platform that empowers businesses to forge authentic customer experiences through AI technology. Headquartered in the vibrant city of San Francisco, we also boast a dynamic presence in Atlanta, New York, London, France, Singapore, and Japan.
Our operations are anchored in core values that shape our culture: Trust, Customer Obsession, Craftsmanship, Intensity, and Family. These principles guide our actions and are integral to our mission.
Our visionary founders, Bret Taylor and Clay Bavor, bring unparalleled expertise. Bret, currently the Board Chair of OpenAI, previously co-led Salesforce and served as CTO at Facebook, while Clay led numerous initiatives at Google, including AR/VR projects and Google Workspace.
Your Role
In your capacity as a Software Engineer on the Site Reliability team, you will play a crucial role in establishing and enhancing the reliability, observability, and scalability of Sierra’s AI-centric infrastructure. Collaborating closely with our engineering and product teams, your goal is to ensure our systems remain highly available, efficient, and primed for growth.
Lead the development of Sierra’s observability stack—including monitoring, alerting, logging, and tracing—to provide engineers with critical insights into system health and performance.
Collaborate with product and platform engineers to architect systems that prioritize reliability and scalability from the outset, not as an afterthought.
Design and implement robust, scalable, and secure cloud infrastructure on AWS, employing Terraform and cutting-edge DevOps tools.
Enhance the reliability and scalability of our LLM deployments, ensuring they operate efficiently and cost-effectively.
Drive improvements in deployment pipelines, CI/CD tooling, and incident management processes to minimize downtime and accelerate response times.
Define and cultivate SRE practices within Sierra, shaping culture, tooling, and best practices across the engineering organization.
Qualifications
Bachelor's degree in Computer Science or a related field, or equivalent experience.
Proven experience in Site Reliability Engineering or a similar role, with a strong understanding of cloud infrastructure (AWS).
Proficiency in Terraform and modern DevOps practices.
Experience with observability tools and techniques—monitoring, alerting, logging, and tracing.
Strong problem-solving skills with a focus on scalability and performance optimization.
Excellent collaboration and communication skills, with the ability to work effectively in a team environment.

