companyThinking Machines Lab logo

Software Engineer - Supercomputing at thinkingmachines | San Francisco

On-site Full-time $350K/yr - $475K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Entry Level

Qualifications

Minimum qualifications:Bachelor’s degree or equivalent experience in computer science, engineering, or a related field. Proficiency in at least one backend programming language (our stack includes Python and Rust). Experience managing large-scale clusters and container orchestration tools (e.g., Kubernetes or Slurm). Comfortable working across the technology stack and leading projects from start to finish.

About the job

At Thinking Machines Lab, our vision is to enhance human potential by advancing collaborative general intelligence. We are dedicated to creating an inclusive future where everyone can harness AI's capabilities tailored to their unique aspirations.

Our team comprises scientists, engineers, and innovators behind some of the most impactful AI solutions, including ChatGPT and Character.ai, as well as open-source projects like PyTorch and Segment Anything.

About the Role

We are seeking a talented Software Engineer to architect, develop, and maintain the GPU supercomputing infrastructure essential for large-scale AI training and inference. Your contributions will ensure high-performance, reliable, and cost-effective computing resources, enabling our users and researchers to achieve rapid advancements at scale.

This is an "evergreen role," open for ongoing interest. We receive numerous applications, and while an immediate fit may not always be available, we encourage you to apply. We actively review applications and reach out when new opportunities arise. Reapplications are welcome after six months, and we also post specific roles for unique projects or teams.

What You’ll Do

  • Automate and manage large GPU clusters, handling provisioning, imaging, and capacity strategy.
  • Develop software that simplifies cluster management, providing a cohesive interface for training and inference tasks.
  • Enhance scheduling and orchestration frameworks (Kubernetes, Slurm, or similar) for optimized resource allocation, preemption, and multi-tenancy management.
  • Monitor and improve operational efficiency, focusing on speed, reliability, and error recovery mechanisms.
  • Design robust storage solutions for datasets, checkpoints, and logs, ensuring clear data retention and lineage.
  • Collaborate with researchers to facilitate large-scale experiments, offering guidance on parallelism and performance considerations.

About Thinking Machines Lab

Thinking Machines Lab is at the forefront of AI innovation, committed to empowering individuals through advanced collaborative intelligence. We believe in democratizing access to AI tools, making them accessible for diverse needs and goals.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.