About the job
About CodeRabbit
CodeRabbit is a pioneering research and development firm dedicated to creating highly efficient human-machine collaboration systems. Our mission is to develop the next generation of AI-driven code review tools, fostering a harmonious partnership between human creativity and advanced algorithms that far exceed the capabilities of individual engineers. By merging language models with human innovation, we aim to elevate the standards of efficiency and quality in software development.
The Role
We are in search of a talented Site Reliability Engineer (SRE) to become a vital part of our Platform Engineering team located in the Bay Area. In this role, you will play a crucial part in maintaining the high availability, performance, and scalability of CodeRabbit's AI-enhanced code review platform. This position lies at the nexus of software engineering and systems operations, where you will construct the foundational platforms and automation that empower our engineering teams to deploy, monitor, and scale our services with reliability.
As a Site Reliability Engineer at CodeRabbit, your responsibilities will include improving the reliability of our essential services that handle millions of code reviews, developing sophisticated automation platforms, and managing the infrastructure that drives our AI analysis engine. You will engage with cutting-edge technologies such as large language models, real-time processing systems, and distributed architectures that function at scale.
Key Responsibilities
Infrastructure & Platform Ownership
Design, implement, and maintain scalable infrastructure on Google Cloud Platform to accommodate CodeRabbit's expanding user base and processing needs.
Take ownership of and operate essential platform services.
Develop and manage Infrastructure as Code using Terraform to guarantee consistent, reproducible, and version-controlled infrastructure deployments.
Reliability & Performance Engineering
Establish and uphold SLI/SLO frameworks for all critical services, ensuring we fulfill our reliability commitments to users.
Implement comprehensive monitoring, alerting, and observability solutions utilizing Datadog and custom instrumentation.
Conduct in-depth incident response, root cause analysis, and post-mortem processes to continually enhance system reliability.
Optimize application and infrastructure performance to manage millions of pull request analyses with minimal latency.

