companyThinking Machines Lab logo

Infrastructure Research Engineer - Numerics at Thinking Machines | San Francisco

On-site Full-time $350K/yr - $475K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

We are looking for candidates with a strong background in systems engineering and numerical optimization. Ideal candidates will have experience with:Large-scale machine learning model trainingDistributed computing frameworksLow-precision numerics and their application in machine learningCollaboration with interdisciplinary teams

About the job

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We envision a future where everyone has access to the knowledge and tools necessary to make AI work for their individual needs and goals. 

Our team comprises scientists, engineers, and innovators who have developed some of the most widely adopted AI products, including ChatGPT and Character.ai, alongside open-weight models like Mistral, as well as popular open-source initiatives such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.

About the Role

We are seeking a highly skilled infrastructure research engineer to architect and develop core systems that facilitate efficient large-scale model training, with a strong emphasis on numerics. You will enhance the numerical foundations of our distributed training stack, focusing on precision formats, kernel optimizations, and communication frameworks to ensure that training trillion-parameter models is stable, scalable, and fast.

This position is perfect for an individual who excels at the intersection of research and systems engineering—a creator who comprehends both the mathematics of optimization and the practicalities of distributed computing.

Note: This is an "evergreen role" that remains open for ongoing expressions of interest. While we receive numerous applications and there may not always be an immediate opening that perfectly matches your skills and experience, we encourage you to apply. We continuously review applications and will contact applicants as new opportunities arise. You are welcome to reapply if you gain additional experience, but please refrain from applying more than once every six months. You may also notice postings for specific roles related to particular projects or teams; in those instances, you are welcome to apply for those positions in addition to the evergreen role.

What You’ll Do

  • Design and optimize distributed training infrastructure for large-scale LLMs, ensuring performance, stability, and reproducibility in multi-GPU and multi-node environments.
  • Implement and assess low-precision numerics (e.g., BF16, MXFP8, NVFP4) to enhance efficiency while maintaining model quality.
  • Develop kernels and communication primitives that leverage hardware-level support for mixed and low-precision arithmetic.
  • Collaborate with research teams to co-design model architectures and training methodologies that align with new numeric formats and stability requirements.
  • Prototype and benchmark scaling strategies, including data, tensor, and pipeline parallelism that integrate precision-adaptive computation and quantized communication.
  • Contribute to the design of our internal orchestration and monitoring frameworks.

About Thinking Machines Lab

Thinking Machines is at the forefront of AI innovation, dedicated to creating tools and systems that harness the power of collaborative general intelligence. Our team is composed of top-tier professionals who contribute to pioneering AI technologies that are reshaping industries and enhancing user experiences worldwide.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.