About the job
About Our Team
Join the Fleet team at OpenAI, where we empower groundbreaking research and product innovation through our advanced computing infrastructure. We manage extensive systems across data centers, GPUs, and networking, ensuring optimal performance, high availability, and efficiency. Our work is crucial in enabling OpenAI’s models to function seamlessly at scale, supporting both our internal research endeavors and external products like ChatGPT. We are committed to prioritizing safety, reliability, and the ethical deployment of AI technology.
About the Role
As a Software Engineer on the Fleet High Performance Computing (HPC) team, you will play a vital role in ensuring the reliability and uptime of OpenAI’s compute fleet. Minimizing hardware failures is essential for smooth research training progress and uninterrupted services, as even minor hardware issues can lead to significant setbacks. With the rise of large supercomputers, the stakes in maintaining efficiency and stability have never been higher.
At the cutting edge of technology, we often lead the charge in troubleshooting complex, state-of-the-art systems at scale. This is a unique opportunity for you to engage with groundbreaking technologies and create innovative solutions that enhance the health and efficiency of our supercomputing infrastructure.
Our team fosters a culture of autonomy and ownership, enabling skilled engineers to drive meaningful change. In this role, you will focus on comprehensive system investigations and develop automated solutions to enhance our operations. We seek individuals who dive deep into challenges, conduct thorough investigations, and create scalable automation for detection and remediation.
Key Responsibilities:
Develop and maintain automation systems for provisioning and managing server fleets.
Create tools to monitor server health, performance metrics, and lifecycle events.
Collaborate effectively with teams across clusters, networking, and infrastructure.
Work closely with external operators to maintain a high level of service quality.
Identify and resolve performance bottlenecks and inefficiencies in the system.
Continuously enhance automation processes to minimize manual intervention.
You Will Excel in This Role if You Have:
Experience in managing large-scale server environments.
A blend of technical skills in systems programming and infrastructure management.
Strong problem-solving abilities and a methodical approach to troubleshooting.
Familiarity with high-performance computing technologies and tools.

