About the job
About Our Team
At OpenAI, the Fleet team is integral to maintaining the robust computing environment that fuels our groundbreaking research and innovative product development. We manage extensive systems encompassing data centers, GPUs, networking, and more, ensuring optimal performance, availability, and efficiency. Our efforts empower OpenAI’s models to function seamlessly at scale, supporting both internal R&D and external offerings like ChatGPT. We emphasize safety, reliability, and responsible AI deployment over unrestrained growth.
About the Role
As a Software Engineer on the Fleet Hardware team, you will play a crucial role in ensuring the reliability and uptime of OpenAI’s compute fleet. Minimizing hardware failures is essential for research training progress and service stability since even minor disruptions can lead to significant setbacks. With the increasing complexity of supercomputers, the pressure to maintain operational integrity has never been higher.
This is a unique opportunity to be at the forefront of technology, pioneering solutions for troubleshooting advanced systems on a large scale. You will work with cutting-edge technologies and innovate solutions to ensure the health and efficiency of our supercomputing infrastructure.
Our team empowers skilled engineers with a significant degree of autonomy and ownership, enabling them to drive impactful change. This role requires a keen focus on comprehensive system investigations and the development of automated solutions. We seek individuals who dive deep into problems, conduct thorough investigations, and create automation for large-scale detection and remediation.
In this Role, You Will:
- Design and maintain automation systems for provisioning and managing server fleets.
- Develop tools to monitor server health, performance, and lifecycle events.
- Collaborate with teams across clusters, networking, and infrastructure.
- Partner with external operators to uphold high-quality standards.
- Identify and resolve performance bottlenecks and inefficiencies.
- Continuously enhance automation to minimize manual tasks.

