companyThinking Machines Lab logo

Infrastructure Research Engineer at thinkingmachines | San Francisco

On-site Full-time $350K/yr - $475K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Minimum qualifications:Bachelor’s degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, or a related field. Familiarity with distributed systems and experience in developing scalable infrastructure. Strong programming skills in languages such as Python, Go, or similar. Understanding of machine learning frameworks and GPU resource management.

About the job

At Thinking Machines Lab, we are committed to empowering humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and aspirations.

Our team comprises scientists, engineers, and builders who have developed some of the most utilized AI products, including ChatGPT and Character.ai, as well as open-weight models like Mistral. We also contribute to notable open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.

About the Role

We are seeking a talented Infrastructure Research Engineer to enhance, scale, and fortify the systems supporting Tinker. This role will enable our internal teams and external clients to fine-tune models seamlessly, reliably, and cost-effectively. You will work at the intersection of large-scale training systems and product infrastructure, creating multi-tenant scheduling, storage, observability, and reliability features within a developer-friendly API.

Your contributions will allow all Tinker users to concentrate on research and development without the burden of infrastructure concerns.

Note: This is an evergreen position that we keep open for ongoing interest. We receive numerous applications, and there may not always be a role that aligns perfectly with your skills and experience. We encourage you to apply, as we continuously review applications and will reach out as new opportunities arise. You are welcome to reapply after gaining more experience, but please refrain from applying more than once every 6 months. We also post specific roles for unique project or team needs, and you are welcome to apply directly to those in addition to this evergreen listing.

What You’ll Do

  • Design and implement distributed job orchestration, placement, preemption, and fair-share scheduling to enhance Tinker for multi-tenant workloads.
  • Optimize GPU utilization, throughput, and reliability across clusters (including autoscaling, bin-packing, and quotas).
  • Develop reusable frameworks and libraries to enhance Tinker’s transparency, reproducibility, and performance.
  • Collaborate with researchers and developer experience engineers to transform fine-tuning challenges into product features.
  • Publish and disseminate insights through internal documentation, open-source libraries, or technical reports to advance the field of scalable AI infrastructure.

About Thinking Machines Lab

Thinking Machines is a pioneering AI lab dedicated to the advancement of collaborative general intelligence. Our innovative team has produced some of the most utilized AI solutions globally, ensuring that technology serves humanity’s diverse needs.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.