companyOpenAI logo

Software Engineer, Fleet Hardware Health

OpenAISan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

You Might Thrive in This Role If You Have:Proven experience managing large-scale server environments. A solid blend of skills in both building and operating complex systems. Strong problem-solving abilities and a passion for innovation. Experience with automation tools and scripting languages. Familiarity with monitoring and performance management tools.

About the job

About Our Team

At OpenAI, the Fleet team is integral to maintaining the robust computing environment that fuels our groundbreaking research and innovative product development. We manage extensive systems encompassing data centers, GPUs, networking, and more, ensuring optimal performance, availability, and efficiency. Our efforts empower OpenAI’s models to function seamlessly at scale, supporting both internal R&D and external offerings like ChatGPT. We emphasize safety, reliability, and responsible AI deployment over unrestrained growth.

About the Role

As a Software Engineer on the Fleet Hardware team, you will play a crucial role in ensuring the reliability and uptime of OpenAI’s compute fleet. Minimizing hardware failures is essential for research training progress and service stability since even minor disruptions can lead to significant setbacks. With the increasing complexity of supercomputers, the pressure to maintain operational integrity has never been higher.

This is a unique opportunity to be at the forefront of technology, pioneering solutions for troubleshooting advanced systems on a large scale. You will work with cutting-edge technologies and innovate solutions to ensure the health and efficiency of our supercomputing infrastructure.

Our team empowers skilled engineers with a significant degree of autonomy and ownership, enabling them to drive impactful change. This role requires a keen focus on comprehensive system investigations and the development of automated solutions. We seek individuals who dive deep into problems, conduct thorough investigations, and create automation for large-scale detection and remediation.

In this Role, You Will:

  • Design and maintain automation systems for provisioning and managing server fleets.
  • Develop tools to monitor server health, performance, and lifecycle events.
  • Collaborate with teams across clusters, networking, and infrastructure.
  • Partner with external operators to uphold high-quality standards.
  • Identify and resolve performance bottlenecks and inefficiencies.
  • Continuously enhance automation to minimize manual tasks.

About OpenAI

OpenAI is at the forefront of AI research and development, dedicated to creating safe and beneficial AI technologies. Our commitment to innovation drives us to explore new frontiers in artificial intelligence, and we are proud to support a diverse team of talented individuals who share our vision for a better future powered by AI.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.