companyOpenAI logo

Software Engineer, Platform Systems

OpenAILondon, UK
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

Proficiency in distributed systems design and architecture, strong programming skills (Python, C++, or similar), experience in performance analysis and debugging, familiarity with machine learning frameworks, and a passion for AI-driven technologies.

About the job

About the Team

At OpenAI, our Platform Systems team is at the forefront of innovation, merging advanced AI technologies with large-scale distributed systems. We create the engineering and research infrastructure essential for training OpenAI’s premier models on some of the world's most powerful, custom-designed supercomputers.

Our dedicated team develops critical model training software, delving deep into the technical stack to enhance collective communication, optimize compute efficiency, refine parallelism strategies, and improve fault tolerance, failure detection, and observability. The systems we engineer are pivotal to accelerating OpenAI’s research, ensuring reliable and efficient training at the cutting edge of technology.

We foster close collaboration with researchers throughout the organization, consistently integrating insights from across OpenAI to advance our training platform.

About the Role

As a Software Engineer specializing in Platform Systems, you will architect and develop distributed systems that enhance visibility into large-scale training workloads, ensuring their reliable operation at scale.

Your responsibilities will include engineering failure detection, tracing, and observability systems to identify suboptimal or malfunctioning nodes, revealing performance bottlenecks, and enabling engineers to optimize extensive distributed training jobs. This infrastructure is vital for the functionality of OpenAI’s training stack and is continuously evolving to accommodate new use cases and increasingly intricate workloads.

This position is central to our training infrastructure, merging systems engineering, performance analysis, and large-scale debugging.

In This Role, You Will

  • Design and implement distributed systems for failure detection, tracing, and profiling large-scale AI training tasks.

  • Create tools to detect slow, faulty, or erratic nodes, providing actionable insights into system performance.

  • Enhance observability, reliability, and performance across OpenAI’s training framework.

  • Diagnose and resolve challenges in complex, high-throughput distributed environments.

  • Collaborate with systems, infrastructure, and research teams to elevate platform capabilities.

  • Adapt and extend failure detection and tracing systems in support of new training methodologies and workloads.

You Might Thrive in This Role If You

  • Possess a strong passion for performance, stability, and observability in distributed systems.

  • Have experience with large-scale systems and are excited to tackle complex challenges in AI infrastructure.

About OpenAI

OpenAI is a pioneer in the AI landscape, committed to developing safe and beneficial AI technologies. Our team is composed of passionate experts dedicated to pushing the boundaries of innovation in artificial intelligence. We strive to create a collaborative environment where creativity and technical excellence thrive, ensuring that our developments contribute positively to society.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.