About the job
Become a vital part of the engineering teams that responsibly bring OpenAI’s transformative technologies to the world!
At OpenAI, our Applied Engineering team collaborates across research, engineering, product management, and design to deliver AI solutions to both consumers and businesses. We are committed to learning from our deployments, maximizing the benefits of AI, and ensuring that this powerful technology is utilized both safely and ethically. Our priority is safety over unchecked growth.
About the Role
As OpenAI continues to expand, we are seeking seasoned engineers who excel in problem-solving to enhance the scalability of our systems. Our achievements hinge on our ability to rapidly iterate on product development while ensuring optimal performance and reliability. You will thrive in a collaborative, fast-paced environment, playing a key role in delivering our technology to millions globally, with a focus on safety and reliability. As a reliability engineer, you will lead efforts to maintain and improve the stability, scalability, and performance of our dynamic infrastructure. You will collaborate closely with cross-functional teams, including software engineers, product managers, and data scientists, to construct and sustain robust systems capable of accommodating our growing user base and workload.
Your Responsibilities Include:
Designing and implementing solutions to scale our infrastructure to meet increasing demands effectively.
Developing and maintaining load, chaos, and synthetic testing software that enhances the reliability of systems designed by development teams.
Creating and managing automation tools to streamline repetitive tasks and bolster system reliability.
Overseeing the lifecycle management platform for CPU/storage, GPU, and network resources to foster efficiency and support dynamic optimization.
Implementing fault-tolerant and resilient design patterns to minimize service interruptions.
Establishing and maintaining service level objectives (SLOs) and service level indicators (SLIs) to ensure system reliability.
Collaborating with researchers, engineers, product managers, and designers to introduce new features and research advancements to the world.
Participating in an on-call rotation to address critical incidents and ensure 24/7 system availability.
Your Impact: Your contributions will be essential in guaranteeing the reliability and performance of our platforms as we continue to scale our operations.

