companyOpenAI logo

Software Engineer, GPU Infrastructure - HPC

OpenAISan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Experience in managing large-scale server environments; strong systems programming skills; expertise in high-performance computing technologies; excellent problem-solving capabilities and a methodical approach to troubleshooting.

About the job

About Our Team

Join the Fleet team at OpenAI, where we empower groundbreaking research and product innovation through our advanced computing infrastructure. We manage extensive systems across data centers, GPUs, and networking, ensuring optimal performance, high availability, and efficiency. Our work is crucial in enabling OpenAI’s models to function seamlessly at scale, supporting both our internal research endeavors and external products like ChatGPT. We are committed to prioritizing safety, reliability, and the ethical deployment of AI technology.

About the Role

As a Software Engineer on the Fleet High Performance Computing (HPC) team, you will play a vital role in ensuring the reliability and uptime of OpenAI’s compute fleet. Minimizing hardware failures is essential for smooth research training progress and uninterrupted services, as even minor hardware issues can lead to significant setbacks. With the rise of large supercomputers, the stakes in maintaining efficiency and stability have never been higher.

At the cutting edge of technology, we often lead the charge in troubleshooting complex, state-of-the-art systems at scale. This is a unique opportunity for you to engage with groundbreaking technologies and create innovative solutions that enhance the health and efficiency of our supercomputing infrastructure.

Our team fosters a culture of autonomy and ownership, enabling skilled engineers to drive meaningful change. In this role, you will focus on comprehensive system investigations and develop automated solutions to enhance our operations. We seek individuals who dive deep into challenges, conduct thorough investigations, and create scalable automation for detection and remediation.

Key Responsibilities:

  • Develop and maintain automation systems for provisioning and managing server fleets.

  • Create tools to monitor server health, performance metrics, and lifecycle events.

  • Collaborate effectively with teams across clusters, networking, and infrastructure.

  • Work closely with external operators to maintain a high level of service quality.

  • Identify and resolve performance bottlenecks and inefficiencies in the system.

  • Continuously enhance automation processes to minimize manual intervention.

You Will Excel in This Role if You Have:

  • Experience in managing large-scale server environments.

  • A blend of technical skills in systems programming and infrastructure management.

  • Strong problem-solving abilities and a methodical approach to troubleshooting.

  • Familiarity with high-performance computing technologies and tools.

About OpenAI

OpenAI is at the forefront of AI research and product development, dedicated to ensuring that artificial intelligence benefits all of humanity. Our Fleet team plays a pivotal role in supporting our innovative computing environment, enabling us to push the boundaries of what's possible with AI technology.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.