companyOpenAI logo

Site Reliability Engineer - Infrastructure for Analytics Platform

OpenAISan FranciscoNew
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

QualificationsWe are looking for candidates who possess a strong background in site reliability engineering or a related field, with significant experience in managing data-heavy applications. Familiarity with ClickHouse and Kafka is essential, as is a solid understanding of cloud infrastructure and automation tools. Ideal candidates will have:Proven expertise in managing large-scale distributed systems. Experience with Infrastructure as Code (IaC) practices. Strong problem-solving skills and the ability to work independently. Excellent communication skills, both verbal and written. A passion for optimizing performance and reliability in production environments.

About the job

The Scaling team at OpenAI builds and maintains the core infrastructure that supports research efforts. This group focuses on enabling rapid progress toward Artificial General Intelligence by providing the systems and tools researchers rely on every day. Their work covers everything from foundational infrastructure to specialized applications, all designed to handle increasing complexity and scale without sacrificing reliability or ease of use.

Role overview

OpenAI is seeking a Site Reliability Engineer to manage and improve the infrastructure behind its analytics platform. This position centers on supporting production systems that handle data-intensive, low-latency workloads. Key technologies include large-scale ClickHouse clusters, high-throughput Kafka pipelines, and stable integrations with Snowflake. The engineer in this role will turn ambiguous operational challenges into concrete solutions, deliver improvements quickly, and iterate based on real-world feedback.

Success in this role means independently setting and raising operational standards, working closely with production systems, and collaborating across teams to ensure reliability at scale.

Key responsibilities

  • Manage the full lifecycle of infrastructure: provisioning, upgrades, scaling, and decommissioning using Infrastructure as Code (IaC).
  • Operate and scale ClickHouse clusters, including sharding, replication, capacity planning, tuning, and maintenance.
  • Run Kafka as the primary data ingestion layer, improving throughput, managing lag and backpressure, and ensuring robust failure recovery.
  • Improve latency and reliability for workloads involving heavy data serving and querying.
  • Develop and maintain monitoring and alerting systems, including SLIs/SLOs, dashboards, alert policies, and actionable runbooks.
  • Create and refine incident response protocols, on-call procedures, and postmortem practices.
  • Oversee backup, restore, and disaster recovery strategies, including regular drills.
  • Plan and execute safe rollouts across development, staging, and production environments, using canary deployments and rollback plans.
  • Work daily with software engineers to embed reliability into system design, implementation, and release cycles.
  • Set and promote standards for operational readiness and runbooks, encouraging adoption across teams.
  • Enhance CI/CD pipelines and improve the developer experience for greater speed and safety.

About OpenAI

OpenAI is at the forefront of artificial intelligence research and development. Our commitment to creating safe and beneficial AI technologies drives our innovative approaches and solutions. We empower researchers and engineers to push the boundaries of what is possible, fostering a collaborative environment that prioritizes ethical AI advancement.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.