companyOpenAI logo

Software Engineer, Fleet Infrastructure

OpenAISan Francisco
Hybrid Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

You might thrive in this role if you:Have a solid background in hyperscale compute systems. Possess strong programming skills in relevant languages. Have experience working within public cloud environments, particularly Azure. Are familiar with Kubernetes and its operational intricacies. Exhibit an execution-focused mindset, complemented by a rigorous attention to user requirements. Bonus: Have a foundational understanding of AI/ML workloads.

About the job

Join the Fleet Infrastructure team at OpenAI, where you will play a pivotal role in managing and enhancing one of the world's largest and most efficient GPU fleets, dedicated to powering OpenAI's advanced model training and deployment initiatives. Your contributions will range from:

  • Developing user-friendly scheduling and quota systems to maximize GPU utilization.

  • Creating automated solutions for seamless Kubernetes cluster provisioning and upgrades, ensuring a robust and low-maintenance platform.

  • Building service frameworks and deployment systems that support diverse research workflows.

  • Enhancing model startup times through high-performance snapshot delivery, leveraging advanced blob storage and hardware caching techniques.

  • And much more!

About the Role

As a Software Engineer in Fleet Infrastructure, you will design, develop, deploy, and maintain essential infrastructure systems that facilitate model training and deployment on a massive GPU fleet. This role presents an exciting opportunity to influence a critical system that supports OpenAI's mission to responsibly advance AI capabilities, all while working in a fast-paced environment with tight deadlines.

Positioned in San Francisco, CA, we embrace a hybrid work model, encouraging three days in the office each week, along with offering relocation assistance for new hires.

In this role, you will:

  • Design, implement, and manage components of our compute fleet, focusing on job scheduling, cluster management, snapshot delivery, and CI/CD systems.

  • Collaborate closely with research and product teams to understand and meet workload requirements effectively.

  • Work alongside hardware, infrastructure, and business teams to deliver a service characterized by high utilization and reliability.

About OpenAI

About OpenAIAt OpenAI, we are committed to developing artificial intelligence that benefits humanity. Our innovative team is at the forefront of AI research and application, striving to create powerful technologies responsibly. By joining us, you will be part of a vibrant community driving meaningful advancements in AI.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.