companyOpenAI logo

Training Performance Engineer

OpenAISan Francisco
Hybrid Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Candidates should possess a strong background in computer science, software engineering, or a related field, with extensive experience in performance engineering within distributed systems. Proficiency in GPU architecture, parallel programming, and performance profiling tools is essential. Familiarity with machine learning concepts and distributed training frameworks will be highly advantageous. Strong analytical and problem-solving skills, alongside an ability to collaborate effectively with cross-functional teams, are crucial for success in this role.

About the job

About Our Team
The Training Runtime team is at the forefront of developing a cutting-edge distributed machine learning training runtime, enabling everything from pioneering research to large-scale model deployments. Our mission is to empower researchers while facilitating growth into frontier-scale operations. We are crafting a cohesive, modular runtime that adapts to researchers’ evolving needs as they progress along the scaling curve.

Our focus is anchored in three key areas: optimizing high-performance, asynchronous data movement that is aware of tensor and optimizer states; building robust, fault-tolerant training frameworks that incorporate comprehensive state management, resilient checkpointing, deterministic orchestration, and advanced observability; and managing distributed processes for enduring, job-specific, and user-defined workflows.

We aim to seamlessly integrate proven large-scale capabilities into a developer-friendly runtime, enabling teams to iterate rapidly and operate reliably across various scales. Our success is gauged by both the enhancement of training throughput (the speed of model training) and researcher throughput (the pace at which ideas transform into experiments and products).

About the Role
As a Training Performance Engineer, you will be instrumental in driving efficiency enhancements throughout our distributed training architecture. Your responsibilities will include analyzing extensive training runs, pinpointing utilization gaps, and engineering optimizations that maximize throughput and system uptime. This position merges a profound understanding of systems with practical performance engineering—analyzing GPU kernel performance, collective communication throughput, and investigating I/O bottlenecks, while also implementing model sharding techniques for large-scale training.

Your efforts will ensure our clusters operate at peak performance, enabling OpenAI to develop larger and more sophisticated models within existing compute budgets.

This position is located in San Francisco, CA, utilizing a hybrid work model with three days in the office each week, and we offer relocation assistance for new hires.

Key Responsibilities:

  • Analyze end-to-end training runs to detect performance bottlenecks across computation, communication, and storage.

  • Enhance GPU utilization and throughput for large-scale distributed model training.

  • Collaborate with runtime and systems engineers to boost kernel efficiency, scheduling, and collective communication performance.

  • Implement model graph transformations to enhance overall throughput.

  • Develop tools for monitoring and visualizing metrics such as MFU, throughput, and uptime across clusters.

About OpenAI

OpenAI is a pioneering organization dedicated to advancing digital intelligence in a way that is safe and beneficial for humanity. We are committed to conducting cutting-edge research and developing innovative technologies that push the boundaries of what's possible in the field of artificial intelligence. Our diverse team of experts strives to create an inclusive environment where creativity and collaboration thrive, enabling us to tackle some of the most complex challenges in AI.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.