About the job
About Our Team
The Training Runtime team is responsible for developing the essential distributed runtime that supports everything from early-stage research projects to large-scale model training. Our focus is on creating robust, scalable, and high-performance components to enhance our distributed training capabilities. We aim to optimize the productivity of our researchers and hardware, contributing to the advancement of Artificial General Intelligence (AGI).
Within our team, the Process Management division creates the distributed operating system that orchestrates and monitors the multitude of processes involved in contemporary training tasks. Our runtime operates beneath various training frameworks and above research infrastructures, ensuring reliable job execution across expansive clusters while prioritizing performance, stability, and observability.
We measure our success through system reliability and the speed at which researchers can turn ideas into production training runs.
About the Position
As a Process Management Engineer specializing in Training Runtime, your role will involve developing software that integrates thousands of computers into a cohesive system. This system must cater to individual researchers conducting numerous parallel experiments, as well as facilitate our largest training operations, which may involve hundreds of thousands or even millions of machines and accelerators. Therefore, it is essential to create user-friendly, introspectable systems that support quick debugging and development while ensuring ongoing optimization for scalability without sacrificing stability and performance.
Your primary programming language will be Rust, where you will build high-performance asynchronous systems with a particular focus on performance, accuracy, and scalability.
Working at this scale and within the cutting-edge field of AI development presents unique challenges. Conventional solutions may not be effective, and the issues you tackle will be complex and require strong design intuition and effective execution to enhance our infrastructure.
We seek individuals passionate about optimizing end-to-end platforms and understanding high-performance architectures to maximize both local and distributed performance across our supercomputers. We are looking for engineers who thrive in a fast-paced environment and can respond to the dynamic and evolving requirements of our training runtime and computational framework.
This role is based in London, UK, and follows a hybrid work model, requiring three days in the office each week. We also provide relocation assistance for new hires.

