About the job
About Our Team
The Scaling team at OpenAI is dedicated to designing, constructing, and managing essential infrastructure that powers groundbreaking research.
Our mission is straightforward: to expedite the advancement of research towards Artificial General Intelligence (AGI). We achieve this by developing foundational systems that researchers depend on, spanning from core infrastructure elements to specialized applications tailored for research. Our systems are designed to scale efficiently with the growing complexity and size of our workloads while ensuring reliability and user-friendliness.
About the Position
We are seeking a Senior Software Engineer to take charge of critical production infrastructure from start to finish.
This role primarily focuses on backend and systems engineering, with a strong emphasis on low-level performance, distributed systems, and the hands-on management of vital services at scale. You will be responsible for transforming ambiguous challenges into actionable plans, delivering pragmatic solutions promptly, and refining them based on real-world feedback and iterations.
This position goes beyond a standard Python backend role; we are specifically on the lookout for candidates with robust systems experience in Rust or C++, particularly in performance-sensitive infrastructure.
This is an in-office role based in San Francisco, CA, following a hybrid model of three days in the office per week. We also provide relocation assistance for new hires.
Your Responsibilities
- Manage critical infrastructure throughout its lifecycle, including design, implementation, deployment, operation, and ongoing improvements.
- Develop and maintain high-performance backend systems in Rust or C++ that facilitate core research operations.
- Design and optimize distributed data and serving systems, considering partitioning, replication, consistency, retries, backpressure, and failure isolation.
- Identify and resolve production bottlenecks related to latency, throughput, contention, hot spots, and overload scenarios.
- Oversee mission-critical services, including on-call duties, incident management, postmortems, observability, deployment safety, and zero-downtime migrations.
- Enhance the reliability of services running on Kubernetes, focusing on resource tuning and failure management.
- Collaborate closely with engineers and researchers to deliver fast, dependable, and effective systems.
- Elevate standards through strong technical judgment, ownership, and commitment to quality.
You Will Excel in This Role If You Have:
- A proven track record of owning and delivering operationally critical systems end to end in ambiguous settings.
- Experience with systems programming in Rust or C++.
- Strong analytical skills and a problem-solving mindset.
- Excellent communication and collaboration skills.

