About the job
About Our Team
Join our dynamic Infrastructure organization at OpenAI, where we are actively seeking talented software engineers to bolster our efforts across several high-impact teams. With a variety of focus areas available—including Core Distributed Systems, Databases, Observability, and Cloud Infrastructure—you'll have the opportunity to work on projects that fascinate you. Our teams operate with a high level of autonomy and foster a deeply collaborative environment, all dedicated to enhancing safety, reliability, and operational velocity across the organization.
About the Role
As a Software Engineer focused on Infrastructure Reliability, you will play a pivotal role in scaling and fortifying the infrastructure that supports some of the world’s most widely utilized AI systems. Your work will ensure that our systems maintain high reliability, observability, performance, and security—enabling researchers to iterate rapidly and allowing products like ChatGPT and the OpenAI API to effectively serve millions of users.
This hands-on, impactful role is perfect for engineers who enjoy ownership, excel at solving complex technical challenges across the entire stack, and wish to contribute to systems that facilitate cutting-edge research deployed on a global scale. You will significantly influence technical direction, enhance system resilience, and collaborate closely with infrastructure, product, and research teams to transform intricate infrastructure into dependable platforms.
Key Responsibilities
Design, construct, and maintain reliable, high-performance systems utilized across engineering.
Identify and resolve performance bottlenecks and inefficiencies, ensuring our infrastructure scales appropriately.
Investigate and troubleshoot complex issues thoroughly.
Enhance automation to minimize manual tasks and improve internal developer tools.
Participate in incident response, postmortem analysis, and the development of best practices surrounding system reliability and scalability.
Ideal Candidate Profile
Possess a deep understanding of distributed systems principles, with a proven track record in developing and managing scalable, reliable systems.
Demonstrate a strong focus on performance and optimization, with the ability to maximize efficiency in complex, globally distributed systems.
Have experience managing orchestration systems such as Kubernetes at scale and creating abstractions over cloud platforms.
Be comfortable working within Linux environments and possess strong problem-solving skills.

