About the job
About Our Team
The Frontier Systems team at OpenAI is at the forefront of technology, responsible for creating, deploying, and maintaining some of the world's largest supercomputers. These supercomputers are pivotal for training our most advanced AI models, pushing the boundaries of innovation.
We transform sophisticated data center designs into operational systems and develop the software infrastructure necessary for extensive frontier model training. Our goal is to ensure these hyperscale supercomputers operate reliably and efficiently, supporting groundbreaking AI research.
About the Role
As a key member of the Frontier Systems team, you will be instrumental in designing the critical infrastructure that ensures our supercomputers function seamlessly for pioneering AI research. In this role, you'll address system-level challenges and implement automation solutions that minimize disruptions during large-scale training processes.
Your responsibilities will encompass end-to-end ownership of your projects, allowing you to make significant contributions to our mission. This position is ideal for individuals who excel in diagnosing complex system issues and crafting automation strategies to proactively resolve problems across a vast network of machines.
Your Responsibilities Include:
- Enhancing system health checks to maintain the stability of our hyperscale supercomputers during model training.
- Conducting in-depth investigations into hardware failures and system-level bugs to uncover root causes.
- Developing automation tools that monitor and resolve issues across thousands of systems, enabling uninterrupted research progress.
You May Be a Great Fit If You Possess:
- 7+ years of hands-on experience in software engineering.
- Strong proficiency in Python and shell scripting.
- Expertise in analyzing complex data sets using SQL, PromQL, Pandas, or other relevant tools.
- Experience in creating reproducible analyses.
- A solid balance of skills in both building and operationalizing systems.
Prior experience with hardware is not a prerequisite for this role.
Preferred Qualifications:
- Familiarity with the intricacies of hardware components, protocols, and Linux tools (e.g., PCIe, Infiniband, networking, power management, kernel performance tuning).
- Experience with system optimization and performance tuning.

