About the job
About Our Team
The Training Runtime team is at the forefront of developing a cutting-edge distributed machine learning training runtime, enabling everything from pioneering research to large-scale model deployments. Our mission is to empower researchers while facilitating growth into frontier-scale operations. We are crafting a cohesive, modular runtime that adapts to researchers’ evolving needs as they progress along the scaling curve.
Our focus is anchored in three key areas: optimizing high-performance, asynchronous data movement that is aware of tensor and optimizer states; building robust, fault-tolerant training frameworks that incorporate comprehensive state management, resilient checkpointing, deterministic orchestration, and advanced observability; and managing distributed processes for enduring, job-specific, and user-defined workflows.
We aim to seamlessly integrate proven large-scale capabilities into a developer-friendly runtime, enabling teams to iterate rapidly and operate reliably across various scales. Our success is gauged by both the enhancement of training throughput (the speed of model training) and researcher throughput (the pace at which ideas transform into experiments and products).
About the Role
As a Training Performance Engineer, you will be instrumental in driving efficiency enhancements throughout our distributed training architecture. Your responsibilities will include analyzing extensive training runs, pinpointing utilization gaps, and engineering optimizations that maximize throughput and system uptime. This position merges a profound understanding of systems with practical performance engineering—analyzing GPU kernel performance, collective communication throughput, and investigating I/O bottlenecks, while also implementing model sharding techniques for large-scale training.
Your efforts will ensure our clusters operate at peak performance, enabling OpenAI to develop larger and more sophisticated models within existing compute budgets.
This position is located in San Francisco, CA, utilizing a hybrid work model with three days in the office each week, and we offer relocation assistance for new hires.
Key Responsibilities:
Analyze end-to-end training runs to detect performance bottlenecks across computation, communication, and storage.
Enhance GPU utilization and throughput for large-scale distributed model training.
Collaborate with runtime and systems engineers to boost kernel efficiency, scheduling, and collective communication performance.
Implement model graph transformations to enhance overall throughput.
Develop tools for monitoring and visualizing metrics such as MFU, throughput, and uptime across clusters.

