About the job
About Our Innovative Team
Join the Workload team at OpenAI, where we are at the forefront of designing and managing the cutting-edge infrastructure that drives the training and inference of large language models (LLMs) at an unprecedented scale. Our systems are engineered to harmonize the complex processes of model training and serving, abstracting performance, parallelism, and execution across extensive GPU and accelerator networks. This robust foundation allows researchers to concentrate on elevating model capabilities, while we take care of the scalability, efficiency, and reliability needed to bring these advanced models to life.
Your Role and Responsibilities
We are seeking a talented engineer to design and implement the dataset infrastructure that will fuel OpenAI’s next-generation training stack. Your primary focus will be on creating standardized dataset interfaces, scaling pipelines across thousands of GPUs, and proactively identifying and addressing performance bottlenecks. Collaboration with multimodal researchers and infrastructure teams will be key to ensuring that our datasets are unified, efficient, and user-friendly.
Key Responsibilities Include:
Design and maintain standardized dataset APIs, including those for multimodal (MM) data that exceeds memory capacity.
Develop proactive testing and validation pipelines for dataset loading at GPU scale.
Work collaboratively to integrate datasets into training and inference pipelines, ensuring seamless user experiences.
Document and maintain dataset interfaces to ensure they are discoverable, consistent, and easily adoptable by other teams.
Establish validation systems to assure datasets remain reproducible and unchanged once standardized.
Identify and troubleshoot performance bottlenecks in distributed dataset loading, such as stragglers impacting global training speed.
Create visualization and inspection tools to highlight errors, bugs, or bottlenecks in datasets.
Ideal Candidate Profile
Possess strong engineering fundamentals and experience in distributed systems, data pipelines, or infrastructure.
Have a proven track record in building APIs, modular code, and scalable abstractions, with a user-centric approach to design.
Be adept at debugging performance issues across large-scale machine fleets.
Demonstrate a passion for advancing data infrastructure to enhance research capabilities.

