companyOpenAI logo

Site Reliability Engineer, Frontier Systems Infrastructure

OpenAISan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Strong background in distributed systems and infrastructure management, proficiency with Kubernetes, and experience in automating large-scale deployments. Familiarity with networking, monitoring tools, and system health management is essential. A proactive approach to problem-solving and the ability to work in high-pressure environments are crucial for success in this role.

About the job

About Our Team

The Frontier Systems team at OpenAI is at the forefront of technological innovation, responsible for designing, deploying, and maintaining state-of-the-art supercomputers that power our most advanced model training initiatives. We transform innovative data center designs into fully functional systems and develop the necessary software to support extensive frontier model training.

Our mission is to ensure the stability and efficiency of these hyperscale supercomputers, providing an uninterrupted environment for the training of frontier models.

About the Opportunity

We are seeking passionate engineers to manage the next generation of compute clusters that fuel OpenAI’s leading-edge research. This role merges distributed systems engineering with practical infrastructure expertise across our expansive data centers. You will be tasked with scaling Kubernetes clusters to unprecedented levels, automating bare-metal deployments, and creating software solutions that simplify interactions across a multitude of nodes in various data centers.

You will operate at the confluence of hardware and software, where speed and reliability are of utmost importance. Prepare to oversee dynamic operations, swiftly diagnose and resolve critical issues, and continuously enhance automation and system uptime.

Key Responsibilities:

  • Deploy and scale substantial Kubernetes clusters, implementing automation for provisioning, bootstrapping, and lifecycle management.

  • Create software abstractions that integrate multiple clusters, delivering a seamless interface for training workloads.

  • Oversee node deployment from bare metal to firmware upgrades, ensuring swift and repeatable processes at scale.

  • Enhance operational metrics, striving to minimize cluster restart times (e.g., reducing from hours to minutes) and expedite firmware or OS upgrades.

  • Integrate networking and hardware health systems to ensure comprehensive reliability across servers, switches, and data center infrastructure.

  • Develop monitoring and observability systems that proactively identify issues and maintain cluster stability under peak loads.

  • Be prepared to perform at the level of a software engineer in execution and problem-solving.

You May Be a Great Fit If You:

  • Possess extensive experience in operating or scaling Kubernetes clusters or similar container orchestration systems.

About OpenAI

OpenAI, a leader in artificial intelligence research and deployment, is dedicated to ensuring that artificial general intelligence benefits all of humanity. Our innovative team is committed to pushing the boundaries of what's possible with AI, and we are looking for talented engineers to join us in this mission.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.