About the job
At Thinking Machines Lab, our vision is to enhance human potential by advancing collaborative general intelligence. We are dedicated to creating an inclusive future where everyone can harness AI's capabilities tailored to their unique aspirations.
Our team comprises scientists, engineers, and innovators behind some of the most impactful AI solutions, including ChatGPT and Character.ai, as well as open-source projects like PyTorch and Segment Anything.
About the Role
We are seeking a talented Software Engineer to architect, develop, and maintain the GPU supercomputing infrastructure essential for large-scale AI training and inference. Your contributions will ensure high-performance, reliable, and cost-effective computing resources, enabling our users and researchers to achieve rapid advancements at scale.
This is an "evergreen role," open for ongoing interest. We receive numerous applications, and while an immediate fit may not always be available, we encourage you to apply. We actively review applications and reach out when new opportunities arise. Reapplications are welcome after six months, and we also post specific roles for unique projects or teams.
What You’ll Do
- Automate and manage large GPU clusters, handling provisioning, imaging, and capacity strategy.
- Develop software that simplifies cluster management, providing a cohesive interface for training and inference tasks.
- Enhance scheduling and orchestration frameworks (Kubernetes, Slurm, or similar) for optimized resource allocation, preemption, and multi-tenancy management.
- Monitor and improve operational efficiency, focusing on speed, reliability, and error recovery mechanisms.
- Design robust storage solutions for datasets, checkpoints, and logs, ensuring clear data retention and lineage.
- Collaborate with researchers to facilitate large-scale experiments, offering guidance on parallelism and performance considerations.

