About the job
The Scaling team at OpenAI builds and maintains the core infrastructure that supports research efforts. This group focuses on enabling rapid progress toward Artificial General Intelligence by providing the systems and tools researchers rely on every day. Their work covers everything from foundational infrastructure to specialized applications, all designed to handle increasing complexity and scale without sacrificing reliability or ease of use.
Role overview
OpenAI is seeking a Site Reliability Engineer to manage and improve the infrastructure behind its analytics platform. This position centers on supporting production systems that handle data-intensive, low-latency workloads. Key technologies include large-scale ClickHouse clusters, high-throughput Kafka pipelines, and stable integrations with Snowflake. The engineer in this role will turn ambiguous operational challenges into concrete solutions, deliver improvements quickly, and iterate based on real-world feedback.
Success in this role means independently setting and raising operational standards, working closely with production systems, and collaborating across teams to ensure reliability at scale.
Key responsibilities
- Manage the full lifecycle of infrastructure: provisioning, upgrades, scaling, and decommissioning using Infrastructure as Code (IaC).
- Operate and scale ClickHouse clusters, including sharding, replication, capacity planning, tuning, and maintenance.
- Run Kafka as the primary data ingestion layer, improving throughput, managing lag and backpressure, and ensuring robust failure recovery.
- Improve latency and reliability for workloads involving heavy data serving and querying.
- Develop and maintain monitoring and alerting systems, including SLIs/SLOs, dashboards, alert policies, and actionable runbooks.
- Create and refine incident response protocols, on-call procedures, and postmortem practices.
- Oversee backup, restore, and disaster recovery strategies, including regular drills.
- Plan and execute safe rollouts across development, staging, and production environments, using canary deployments and rollback plans.
- Work daily with software engineers to embed reliability into system design, implementation, and release cycles.
- Set and promote standards for operational readiness and runbooks, encouraging adoption across teams.
- Enhance CI/CD pipelines and improve the developer experience for greater speed and safety.

