companyOpenAI logo

Software Engineer, Frontier Systems

OpenAISan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

7+ years of experience in software engineering; proficiency in Python and shell scripting; comfort with SQL, PromQL, and data analysis tools; experience in reproducible analyses; strong skills in building and operationalizing systems.

About the job

About Our Team

The Frontier Systems team at OpenAI is at the forefront of technology, responsible for creating, deploying, and maintaining some of the world's largest supercomputers. These supercomputers are pivotal for training our most advanced AI models, pushing the boundaries of innovation.

We transform sophisticated data center designs into operational systems and develop the software infrastructure necessary for extensive frontier model training. Our goal is to ensure these hyperscale supercomputers operate reliably and efficiently, supporting groundbreaking AI research.

About the Role

As a key member of the Frontier Systems team, you will be instrumental in designing the critical infrastructure that ensures our supercomputers function seamlessly for pioneering AI research. In this role, you'll address system-level challenges and implement automation solutions that minimize disruptions during large-scale training processes.

Your responsibilities will encompass end-to-end ownership of your projects, allowing you to make significant contributions to our mission. This position is ideal for individuals who excel in diagnosing complex system issues and crafting automation strategies to proactively resolve problems across a vast network of machines.

Your Responsibilities Include:

  • Enhancing system health checks to maintain the stability of our hyperscale supercomputers during model training.
  • Conducting in-depth investigations into hardware failures and system-level bugs to uncover root causes.
  • Developing automation tools that monitor and resolve issues across thousands of systems, enabling uninterrupted research progress.

You May Be a Great Fit If You Possess:

  • 7+ years of hands-on experience in software engineering.
  • Strong proficiency in Python and shell scripting.
  • Expertise in analyzing complex data sets using SQL, PromQL, Pandas, or other relevant tools.
  • Experience in creating reproducible analyses.
  • A solid balance of skills in both building and operationalizing systems.

Prior experience with hardware is not a prerequisite for this role.

Preferred Qualifications:

  • Familiarity with the intricacies of hardware components, protocols, and Linux tools (e.g., PCIe, Infiniband, networking, power management, kernel performance tuning).
  • Experience with system optimization and performance tuning.

About OpenAI

OpenAI is committed to advancing artificial intelligence in a safe and beneficial manner. Our Frontier Systems team is dedicated to developing the most powerful supercomputers to support cutting-edge AI research, driving innovation and excellence in technology.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.