companyOpenAI logo

Software Engineer in Compute Infrastructure

OpenAISan FranciscoNew
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Qualifications:Proficiency in programming languages such as Python, C++, or Go. Experience with distributed systems and high-performance computing. Strong understanding of Kubernetes and orchestration tools. Ability to optimize systems for performance and reliability. Familiarity with hardware architecture and low-level systems programming. Excellent problem-solving skills and engineering judgment.

About the job

Team and Platform Focus

The Compute Infrastructure team at OpenAI designs, builds, and maintains the systems that support AI research at scale. This work brings together accelerators, CPUs, networking, storage, data centers, orchestration software, agent infrastructure, developer tools, and observability. The aim is to create a reliable, unified experience for researchers and product teams across the company.

Projects span the full stack: capacity planning, cluster lifecycle management, bare-metal automation, and distributed systems. The team manages Kubernetes scheduling, system optimization, high-performance networking, storage, fleet health, reliability, workload profiling, benchmarking, and improvements to the developer experience. Even small improvements in communication, scheduling, hardware efficiency, or debugging can significantly accelerate research. OpenAI matches engineers to areas within Compute Infrastructure that align with their skills and interests.

Role Overview

This Software Engineer role centers on building and evolving the compute platform that supports OpenAI’s research and products. Candidates may bring expertise in low-level systems, high-performance computing, distributed infrastructure, reliability, CaaS, agent infrastructure, developer platforms, tooling, or infrastructure user experience. The most important qualities are strong analytical skills, the ability to write resilient code, and a collaborative approach that helps colleagues move faster and with more confidence.

What You Will Work On

  • Working close to hardware or at the user interaction layer
  • Developing CaaS and agent infrastructure
  • Managing control and data planes that connect the system
  • Bringing new supercomputing capabilities online
  • Optimizing training workloads through profiler traces and benchmarks
  • Improving NCCL and collective communication
  • Analyzing GPUs, NICs, topology, firmware, thermal dynamics, and failure modes
  • Designing abstractions to unify diverse clusters into a single platform

Areas of Expertise

No one is expected to cover every area listed. Some engineers focus on system performance, kernel or runtime behavior, large-scale networking protocols, RDMA, NCCL, GPU hardware, benchmarking, scheduling, or hardware reliability. Others improve the platform’s usability through APIs, tools, workflows, and developer experience. The team values strong engineering judgment and a drive to advance the field.

About OpenAI

OpenAI is at the forefront of artificial intelligence research, dedicated to developing safe and beneficial AI technologies. Our innovative environment fosters collaboration and creativity, empowering engineers to work on groundbreaking projects that shape the future of AI.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.