About the job
Thinking Machines Lab brings together scientists, engineers, and innovators who have shaped well-known AI products like ChatGPT and Character.ai, as well as open-weight models such as Mistral. The team also contributes to open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything. The company’s mission centers on advancing collaborative general intelligence, aiming to make AI accessible and adaptable to individual needs.
Tinker, the company’s fine-tuning API, enables researchers and developers to customize advanced AI models using their own data and algorithms. Thinking Machines manages the infrastructure, giving users the flexibility to train open-weight models while focusing on their unique requirements. As Tinker expands, the platform continues to evolve alongside its growing community.
Role overview
The Site Reliability Engineer will focus on improving the reliability and resilience of the Tinker platform. This role involves close collaboration with platform engineers and research teams to strengthen every layer of the system, from infrastructure to user-facing services.
What you will do
- Define and take ownership of end-to-end reliability, including CI/CD workflows, production observability, and incident response processes.
- Set Service Level Objectives for distributed training systems, balancing reliability, scheduling latency, and development speed.
- Design and implement monitoring and observability across the training pipeline.
- Manage incident response for Tinker, ensuring prompt recovery, thorough incident analysis, and systematic improvements to prevent recurrence.
- Enhance multi-tenant isolation and resource scheduling to support LoRA-based workload co-scheduling, maintaining both reliability and data separation.
- Collaborate with security teams to identify and address production vulnerabilities.
This position is based in San Francisco.

