About the job
About Our Team
At OpenAI, our Hardware team is at the forefront of developing cutting-edge silicon and comprehensive system solutions tailored to the specific needs of advanced AI workloads. We pride ourselves on crafting the next generation of AI-native silicon, collaborating closely with software engineers and research teams to ensure our hardware is seamlessly integrated with AI models. Our mission extends beyond creating production-grade silicon for OpenAI’s supercomputing infrastructure; we also innovate custom design tools and methodologies that spark innovation and enable hardware specifically optimized for AI.
About the Role
As a Software Engineer on the Scaling team, you will play a pivotal role in designing and optimizing the foundational stack that manages computation and data flow across OpenAI’s supercomputing clusters. Your responsibilities will include crafting high-performance runtimes, developing custom kernels, enhancing compiler infrastructure, and building scalable simulation systems to validate and optimize distributed training workloads.
This position requires you to work at the intersection of systems programming, machine learning infrastructure, and high-performance computing, where you will create intuitive developer APIs alongside highly efficient runtime systems. You will balance usability and introspection with the imperative for stability and performance across our dynamic hardware landscape.
This role is based in San Francisco, CA, featuring a hybrid work model (three days in-office per week). Relocation assistance is provided.
Key Responsibilities:
Design and implement APIs and runtime components to efficiently manage computation and data movement for diverse ML workloads.
Enhance compiler infrastructure by developing optimizations and compiler passes to accommodate evolving hardware advancements.
Engineer and refine compute and data kernels, ensuring precision, high performance, and compatibility across simulation and production settings.
Analyze and optimize system bottlenecks, focusing on I/O, memory hierarchy, and interconnects at both local and distributed scales.
Create simulation infrastructure to validate runtime behaviors, test modifications to the training stack, and support the early development of hardware and systems.
Quickly deploy updates to runtime and compiler across new supercomputing builds in close collaboration with hardware and research teams.
Work across a varied tech stack, primarily utilizing Rust and Python, with a chance to influence architectural decisions within the training framework.

