companyThinking Machines Lab logo

Infrastructure Research Engineer - Reinforcement Learning Systems

On-site Full-time $350K/yr - $475K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

To excel in this role, candidates should possess a solid understanding of reinforcement learning algorithms and practical experience with large-scale distributed systems. A background in machine learning infrastructure, software engineering, or a related field is beneficial.

About the job

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We're dedicated to crafting a future where everyone can harness the power of AI to meet their unique needs and aspirations.

Our team comprises scientists, engineers, and innovators who have developed some of the most widely utilized AI products, including ChatGPT and Character.ai, as well as open-weight models like Mistral, in addition to renowned open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.

About the Role

We are seeking a talented Infrastructure Research Engineer to architect and develop the foundational systems that facilitate the scalable and efficient training of large models using reinforcement learning.

This position exists at the crossroads of research and large-scale systems engineering, requiring a professional who not only comprehends the algorithms behind reinforcement learning but also appreciates the practicalities of distributed training and inference at scale. You will have a diverse set of responsibilities, from optimizing rollout and reward pipelines to enhancing the reliability, observability, and orchestration of systems. Collaboration with researchers and infrastructure teams will be essential to ensure reinforcement learning is stable, rapid, and production-ready.

Note: This is an evergreen role that we maintain on an ongoing basis to express interest. Due to the high volume of applications we receive, there may not always be an immediate position that aligns perfectly with your skills and experience. We encourage you to apply, as we continuously review applications and reach out to candidates when new opportunities arise. You may reapply after gaining more experience, but please refrain from applying more than once every six months. Additionally, you may notice postings for specific roles that cater to unique project or team needs; in those circumstances, you are welcome to apply directly alongside this evergreen role.

What You’ll Do

  • Design, implement, and optimize the infrastructure that supports large-scale reinforcement learning and post-training workloads.
  • Enhance the reliability and scalability of the RL training pipeline, including distributed RL workloads and training throughput.
  • Create shared monitoring and observability tools to ensure high uptime, debuggability, and reproducibility of RL systems.
  • Work closely with researchers to translate algorithmic concepts into production-quality training pipelines.
  • Develop evaluation and benchmarking infrastructure to assess model performance based on helpfulness, safety, and factual accuracy.
  • Publish and disseminate insights through internal documentation, open-source libraries, or technical reports that contribute to the advancement of scalable AI infrastructure.

About Thinking Machines Lab

Thinking Machines is at the forefront of AI innovation, committed to creating solutions that enable individuals and organizations to leverage AI effectively. Our groundbreaking work has shaped the landscape of AI products and frameworks, driving advancements in technology and knowledge sharing.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.