companyThinking Machines Lab logo

Infrastructure Research Engineer - Training Systems

On-site Full-time $350K/yr - $475K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Minimum Qualifications:Bachelor's degree in Computer Science, Engineering, or a related field. Experience with distributed systems and high-performance computing. Familiarity with machine learning frameworks and tools. Strong programming skills in languages such as Python, C++, or similar. Ability to work collaboratively in a team environment.

About the job

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We envision a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and goals.

Our team comprises scientists, engineers, and builders who have developed some of the most widely utilized AI products, such as ChatGPT and Character.ai, alongside open-weight models like Mistral, and popular open-source initiatives like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.

About the Position

We are seeking an Infrastructure Research Engineer to design and construct the foundational systems that facilitate the scalable and efficient training of large models for both deployment and research purposes. Your primary objective will be to streamline experimentation and training at Thinking Machines, enabling our research teams to concentrate on scientific advancements rather than system limitations.

This role is a perfect match for an individual who possesses a strong blend of deep systems expertise and a keen interest in machine learning at scale. You will take full ownership of the training stack, ensuring that every GPU cycle contributes to scientific progress.

Note: This is an evergreen role that we keep open continuously to express interest. We receive numerous applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. However, we encourage you to apply. We regularly review applications and reach out to candidates as new opportunities arise. Feel free to reapply as you gain more experience, but please avoid applying more than once every six months. We may also post specific roles for individual projects or team needs, in which case you are welcome to apply directly alongside this evergreen role.

Key Responsibilities

  • Design, implement, and optimize distributed training systems that scale across thousands of GPUs and nodes for extensive training workloads.
  • Develop high-performance optimizations to maximize throughput and efficiency.
  • Create reusable frameworks and libraries that enhance training reproducibility, reliability, and scalability for new model architectures.
  • Establish standards for reliability, maintainability, and security, ensuring systems remain robust under rapid iterations.
  • Collaborate with researchers and engineers to construct scalable infrastructure.
  • Publish and disseminate findings through internal documentation, open-source libraries, or technical reports that contribute to the advancement of scalable AI infrastructure.

About Thinking Machines Lab

Thinking Machines is at the forefront of AI innovation, dedicated to empowering individuals and organizations by providing cutting-edge tools and knowledge to leverage artificial intelligence for diverse applications. Join us in our mission to create a more informed and capable world.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.