About the job
Who are we?
At Cohere, our mission is to harness the power of artificial intelligence to enhance human capabilities. We are pioneers in developing and deploying advanced language models for developers and enterprises, enabling them to create transformative experiences such as content generation, semantic search, retrieval-augmented generation (RAG), and intelligent agents. We are committed to driving the widespread adoption of AI technologies.
We take pride in our craftsmanship. Each team member plays a vital role in enhancing our models’ capabilities and delivering exceptional value to our clients. We thrive in a fast-paced environment where our focus is on what is best for our customers.
Cohere is made up of a diverse group of researchers, engineers, designers, and other professionals, all dedicated to their craft. We believe that a variety of perspectives is essential for creating remarkable products.
Join us on our journey and help shape the future of AI!
We are seeking a seasoned engineer to contribute to the design, maintenance, and evolution of the training framework that underpins our state-of-the-art language models. This role is positioned at the crossroads of large-scale training, distributed systems, and high-performance computing (HPC) infrastructure. You will be responsible for architecting and managing core components that facilitate rapid, reliable, and scalable model training, as well as developing tools that connect innovative research ideas to thousands of GPUs.
If you have a passion for working across the entire machine learning systems stack, this role offers you the opportunity and autonomy to make a substantial impact.
What You’ll Work On
Own and enhance the training framework for large-scale LLM training.
Design distributed training abstractions including data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, and checkpointing.
Optimize training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).
Develop and maintain tools for monitoring, logging, debugging, and improving developer experience.
Collaborate with infrastructure teams to ensure our clusters, container environments, and hardware configurations are optimized for high-performance training.
Analyze and troubleshoot performance bottlenecks throughout the machine learning systems stack.
Create robust systems that guarantee reproducible, debuggable, and large-scale training runs.
You Might Be a Great Fit If You Have
Extensive engineering experience in large-scale distributed training or HPC systems.
Deep familiarity with Java, Python, and/or C++ programming languages.
Experience with cloud platforms and container orchestration tools.
Strong problem-solving skills and a collaborative mindset.

