About the job
Our Mission
At Reflection AI, our goal is to develop open superintelligence and make it universally accessible.
We are pioneering open weight models tailored for individuals, agents, enterprises, and even entire nations. Our diverse team comprises talented AI researchers and industry veterans from prestigious organizations such as DeepMind, OpenAI, Google Brain, Meta, Character. AI, Anthropic, and many more.
Role Overview
Construct and enhance distributed training systems that drive the pre-training of cutting-edge models.
Collaborate with research teams to design and execute extensive training runs for foundational models.
Create infrastructure that facilitates efficient training across thousands of GPUs leveraging contemporary distributed training frameworks.
Enhance training throughput, stability, and efficiency for extensive model training tasks.
Work closely with pre-training researchers to convert experimental concepts into scalable, production-ready training systems.
Boost performance of distributed training tasks through optimization of communication, memory management, and GPU utilization.
Develop and maintain training pipelines that accommodate large-scale datasets, checkpointing, and iterative experiments.
Identify and resolve performance bottlenecks within distributed training systems, including model parallelism, GPU communication, and training runtime environments.
Contribute to the creation of systems that promote swift experimentation and iteration on novel training methods.

