companyThinking Machines Lab logo

Site Reliability Engineer at Thinking Machines | San Francisco

Thinking Machines LabSan FranciscoNew
On-site Full-time $350K/yr - $475K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Minimum Qualifications:Bachelor's degree or equivalent experience in computer science, engineering, or a related field. Proven experience in distributed systems, cloud infrastructure, or site reliability engineering. Strong software development skills geared towards solving reliability challenges, including tooling and automation. Experience in managing production incident responses, conducting postmortems, and system improvements.

About the job

Thinking Machines Lab brings together scientists, engineers, and innovators who have shaped well-known AI products like ChatGPT and Character.ai, as well as open-weight models such as Mistral. The team also contributes to open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything. The company’s mission centers on advancing collaborative general intelligence, aiming to make AI accessible and adaptable to individual needs.

Tinker, the company’s fine-tuning API, enables researchers and developers to customize advanced AI models using their own data and algorithms. Thinking Machines manages the infrastructure, giving users the flexibility to train open-weight models while focusing on their unique requirements. As Tinker expands, the platform continues to evolve alongside its growing community.

Role overview

The Site Reliability Engineer will focus on improving the reliability and resilience of the Tinker platform. This role involves close collaboration with platform engineers and research teams to strengthen every layer of the system, from infrastructure to user-facing services.

What you will do

  • Define and take ownership of end-to-end reliability, including CI/CD workflows, production observability, and incident response processes.
  • Set Service Level Objectives for distributed training systems, balancing reliability, scheduling latency, and development speed.
  • Design and implement monitoring and observability across the training pipeline.
  • Manage incident response for Tinker, ensuring prompt recovery, thorough incident analysis, and systematic improvements to prevent recurrence.
  • Enhance multi-tenant isolation and resource scheduling to support LoRA-based workload co-scheduling, maintaining both reliability and data separation.
  • Collaborate with security teams to identify and address production vulnerabilities.

This position is based in San Francisco.

About Thinking Machines Lab

Thinking Machines Lab is at the forefront of AI innovation, committed to empowering individuals and organizations through advanced collaborative intelligence. Our products and contributions are shaping the future of AI, making it more accessible and customizable for diverse applications.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.