About the job
Join the Fleet Infrastructure team at OpenAI, where you will play a pivotal role in managing and enhancing one of the world's largest and most efficient GPU fleets, dedicated to powering OpenAI's advanced model training and deployment initiatives. Your contributions will range from:
Developing user-friendly scheduling and quota systems to maximize GPU utilization.
Creating automated solutions for seamless Kubernetes cluster provisioning and upgrades, ensuring a robust and low-maintenance platform.
Building service frameworks and deployment systems that support diverse research workflows.
Enhancing model startup times through high-performance snapshot delivery, leveraging advanced blob storage and hardware caching techniques.
And much more!
About the Role
As a Software Engineer in Fleet Infrastructure, you will design, develop, deploy, and maintain essential infrastructure systems that facilitate model training and deployment on a massive GPU fleet. This role presents an exciting opportunity to influence a critical system that supports OpenAI's mission to responsibly advance AI capabilities, all while working in a fast-paced environment with tight deadlines.
Positioned in San Francisco, CA, we embrace a hybrid work model, encouraging three days in the office each week, along with offering relocation assistance for new hires.
In this role, you will:
Design, implement, and manage components of our compute fleet, focusing on job scheduling, cluster management, snapshot delivery, and CI/CD systems.
Collaborate closely with research and product teams to understand and meet workload requirements effectively.
Work alongside hardware, infrastructure, and business teams to deliver a service characterized by high utilization and reliability.

