About the job
About the Role
At Generalist, we are at the forefront of training expansive robot foundation models, leveraging cutting-edge GPU hardware, primarily from Nvidia, to execute distributed training tasks and experimental research. Our operations demand exceptional storage solutions and optimized data loading processes, necessitating the full utilization of cloud infrastructure alongside custom-built solutions.
In this role, you will take charge of our inference infrastructure. Our robotic systems rely on a dedicated fleet of on-premises GPUs designed for demanding real-time computations and latency-sensitive applications within resource-constrained environments.
Your Responsibilities:
Manage and optimize our GPU compute fleets.
Facilitate user-friendly access to GPUs for researchers, ensuring optimal utilization.
Enhance ML data loading, transport, and storage systems in extensively utilized distributed environments.
Oversee the orchestration of our robot inference fleets.
You May Excel in This Position If You:
Have experience managing large GPU fleets for large-scale, distributed training or inference.
Possess significant expertise in using Slurm or Kubernetes for ML workload orchestration.
Have developed high-scale ML data loaders and preparation systems.
Understand the intricacies of ML hardware, storage, and networking systems.
Are familiar with the Nvidia GPU ecosystem.

