companyOpenAI logo

Research Infrastructure Engineer for Training Systems

OpenAISan FranciscoNew
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Candidates should ideally possess a Bachelor's Degree in Computer Science or a related field, and demonstrate relevant experience in systems engineering or machine learning infrastructure.

About the job

OpenAI's research infrastructure group creates and maintains the backbone systems for advanced machine learning model training. This team often goes beyond conventional training methods, developing new infrastructure to support novel research at scale. Their work closely connects systems engineering with research progress, making it possible to run experiments that would otherwise be too slow or complex.

Role overview

The Research Infrastructure Engineer for Training Systems designs and improves the platforms that power large-scale ML training. This role bridges research concepts and the practical systems that make large model training possible. The work has a direct impact on model release timelines and requires building systems that perform reliably in demanding, real-world scenarios.

What you will do

  • Build and maintain infrastructure for large-scale model training and experimentation
  • Design APIs and interfaces to simplify complex training workflows and prevent misuse
  • Enhance reliability, debuggability, and performance across training and data pipelines
  • Troubleshoot issues involving Python, PyTorch, distributed systems, GPUs, networking, and storage
  • Create tests, benchmarks, and diagnostic tools to catch regressions early

Requirements

  • Interest in building systems that support new training methods, not just optimizing existing ones
  • Strong instincts in systems engineering, especially regarding performance, reliability, and clean abstractions
  • Experience designing APIs and interfaces for researchers and engineers
  • Ability to work across ML research code and production infrastructure
  • Enjoys evidence-based debugging using profiles, traces, logs, tests, and reproducible cases

About OpenAI

OpenAI is at the forefront of AI research and applications, dedicated to making artificial general intelligence accessible and beneficial to everyone. Our mission involves advancing AI technologies while ensuring their safety and ethical deployment.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.