companyOpenAI logo

Software Engineer, Data Infrastructure - Research

OpenAISan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Strong background in engineering with expertise in distributed systems and data infrastructure. Proven experience in designing APIs and creating scalable solutions. Ability to troubleshoot complex performance issues and a commitment to improving user experience in data handling.

About the job

About Our Innovative Team

Join the Workload team at OpenAI, where we are at the forefront of designing and managing the cutting-edge infrastructure that drives the training and inference of large language models (LLMs) at an unprecedented scale. Our systems are engineered to harmonize the complex processes of model training and serving, abstracting performance, parallelism, and execution across extensive GPU and accelerator networks. This robust foundation allows researchers to concentrate on elevating model capabilities, while we take care of the scalability, efficiency, and reliability needed to bring these advanced models to life.

Your Role and Responsibilities

We are seeking a talented engineer to design and implement the dataset infrastructure that will fuel OpenAI’s next-generation training stack. Your primary focus will be on creating standardized dataset interfaces, scaling pipelines across thousands of GPUs, and proactively identifying and addressing performance bottlenecks. Collaboration with multimodal researchers and infrastructure teams will be key to ensuring that our datasets are unified, efficient, and user-friendly.

Key Responsibilities Include:

  • Design and maintain standardized dataset APIs, including those for multimodal (MM) data that exceeds memory capacity.

  • Develop proactive testing and validation pipelines for dataset loading at GPU scale.

  • Work collaboratively to integrate datasets into training and inference pipelines, ensuring seamless user experiences.

  • Document and maintain dataset interfaces to ensure they are discoverable, consistent, and easily adoptable by other teams.

  • Establish validation systems to assure datasets remain reproducible and unchanged once standardized.

  • Identify and troubleshoot performance bottlenecks in distributed dataset loading, such as stragglers impacting global training speed.

  • Create visualization and inspection tools to highlight errors, bugs, or bottlenecks in datasets.

Ideal Candidate Profile

  • Possess strong engineering fundamentals and experience in distributed systems, data pipelines, or infrastructure.

  • Have a proven track record in building APIs, modular code, and scalable abstractions, with a user-centric approach to design.

  • Be adept at debugging performance issues across large-scale machine fleets.

  • Demonstrate a passion for advancing data infrastructure to enhance research capabilities.

About OpenAI

At OpenAI, we are dedicated to pioneering advancements in artificial intelligence. Our mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. With a team of world-class researchers and engineers, we are pushing the boundaries of AI technology to solve some of the most pressing challenges facing society today.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.