companyCohere logo

Senior ML Systems Engineer, Frameworks & Tooling

CohereLondon
On-site Full-Time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Qualifications

Extensive experience in engineering large-scale distributed training systems or HPC infrastructure. Proficient in programming languages such as Java, Python, or C++. Experience with cloud computing platforms and container orchestration tools. Excellent analytical and problem-solving abilities with a collaborative approach.

About the job

Who are we?

At Cohere, our mission is to harness the power of artificial intelligence to enhance human capabilities. We are pioneers in developing and deploying advanced language models for developers and enterprises, enabling them to create transformative experiences such as content generation, semantic search, retrieval-augmented generation (RAG), and intelligent agents. We are committed to driving the widespread adoption of AI technologies.

We take pride in our craftsmanship. Each team member plays a vital role in enhancing our models’ capabilities and delivering exceptional value to our clients. We thrive in a fast-paced environment where our focus is on what is best for our customers.

Cohere is made up of a diverse group of researchers, engineers, designers, and other professionals, all dedicated to their craft. We believe that a variety of perspectives is essential for creating remarkable products.

Join us on our journey and help shape the future of AI!

We are seeking a seasoned engineer to contribute to the design, maintenance, and evolution of the training framework that underpins our state-of-the-art language models. This role is positioned at the crossroads of large-scale training, distributed systems, and high-performance computing (HPC) infrastructure. You will be responsible for architecting and managing core components that facilitate rapid, reliable, and scalable model training, as well as developing tools that connect innovative research ideas to thousands of GPUs.

If you have a passion for working across the entire machine learning systems stack, this role offers you the opportunity and autonomy to make a substantial impact.

What You’ll Work On

  • Own and enhance the training framework for large-scale LLM training.

  • Design distributed training abstractions including data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, and checkpointing.

  • Optimize training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).

  • Develop and maintain tools for monitoring, logging, debugging, and improving developer experience.

  • Collaborate with infrastructure teams to ensure our clusters, container environments, and hardware configurations are optimized for high-performance training.

  • Analyze and troubleshoot performance bottlenecks throughout the machine learning systems stack.

  • Create robust systems that guarantee reproducible, debuggable, and large-scale training runs.

You Might Be a Great Fit If You Have

  • Extensive engineering experience in large-scale distributed training or HPC systems.

  • Deep familiarity with Java, Python, and/or C++ programming languages.

  • Experience with cloud platforms and container orchestration tools.

  • Strong problem-solving skills and a collaborative mindset.

About Cohere

Cohere is at the forefront of AI innovation, dedicated to developing advanced machine learning models that empower developers and enterprises. Our team is a blend of top-tier researchers, engineers, and designers who are all passionate about creating meaningful AI solutions. We value diversity and believe that different perspectives lead to exceptional products. Join us to be a part of a mission-driven company that is shaping the future of technology.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.