About the job
At Thinking Machines Lab, we are committed to empowering humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and aspirations.
Our team comprises scientists, engineers, and builders who have developed some of the most utilized AI products, including ChatGPT and Character.ai, as well as open-weight models like Mistral. We also contribute to notable open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.
About the Role
We are seeking a talented Infrastructure Research Engineer to enhance, scale, and fortify the systems supporting Tinker. This role will enable our internal teams and external clients to fine-tune models seamlessly, reliably, and cost-effectively. You will work at the intersection of large-scale training systems and product infrastructure, creating multi-tenant scheduling, storage, observability, and reliability features within a developer-friendly API.
Your contributions will allow all Tinker users to concentrate on research and development without the burden of infrastructure concerns.
Note: This is an evergreen position that we keep open for ongoing interest. We receive numerous applications, and there may not always be a role that aligns perfectly with your skills and experience. We encourage you to apply, as we continuously review applications and will reach out as new opportunities arise. You are welcome to reapply after gaining more experience, but please refrain from applying more than once every 6 months. We also post specific roles for unique project or team needs, and you are welcome to apply directly to those in addition to this evergreen listing.
What You’ll Do
- Design and implement distributed job orchestration, placement, preemption, and fair-share scheduling to enhance Tinker for multi-tenant workloads.
- Optimize GPU utilization, throughput, and reliability across clusters (including autoscaling, bin-packing, and quotas).
- Develop reusable frameworks and libraries to enhance Tinker’s transparency, reproducibility, and performance.
- Collaborate with researchers and developer experience engineers to transform fine-tuning challenges into product features.
- Publish and disseminate insights through internal documentation, open-source libraries, or technical reports to advance the field of scalable AI infrastructure.

