About the job
OpenAI's research infrastructure group creates and maintains the backbone systems for advanced machine learning model training. This team often goes beyond conventional training methods, developing new infrastructure to support novel research at scale. Their work closely connects systems engineering with research progress, making it possible to run experiments that would otherwise be too slow or complex.
Role overview
The Research Infrastructure Engineer for Training Systems designs and improves the platforms that power large-scale ML training. This role bridges research concepts and the practical systems that make large model training possible. The work has a direct impact on model release timelines and requires building systems that perform reliably in demanding, real-world scenarios.
What you will do
- Build and maintain infrastructure for large-scale model training and experimentation
- Design APIs and interfaces to simplify complex training workflows and prevent misuse
- Enhance reliability, debuggability, and performance across training and data pipelines
- Troubleshoot issues involving Python, PyTorch, distributed systems, GPUs, networking, and storage
- Create tests, benchmarks, and diagnostic tools to catch regressions early
Requirements
- Interest in building systems that support new training methods, not just optimizing existing ones
- Strong instincts in systems engineering, especially regarding performance, reliability, and clean abstractions
- Experience designing APIs and interfaces for researchers and engineers
- Ability to work across ML research code and production infrastructure
- Enjoys evidence-based debugging using profiles, traces, logs, tests, and reproducible cases

