About the job
At Rhoda AI, we are pioneering the development of a comprehensive foundation for the next generation of humanoid robots. Our focus spans high-performance, software-defined hardware to advanced foundational models and video world models that govern robot functionality. Our robots are engineered to be versatile, capable of navigating intricate, real-world environments and tackling scenarios not previously encountered in training. We stand at the crossroads of large-scale learning, robotics, and systems, bolstered by a research team comprising experts from prestigious institutions such as Stanford, Berkeley, and Harvard. Our ambition is not merely to add features; we are crafting a revolutionary computing platform for physical tasks, underpinned by over $400 million in funding, driving aggressive investments in research & development, hardware innovation, and scaling up manufacturing to bring our vision to fruition.
Role Overview
We are in search of a Principal Machine Learning Systems Engineer to take charge of our training systems' performance from start to finish. You will be instrumental in defining the scaling of our model training, enhancing efficiency, scalability, and accuracy across extensive multimodal training environments. This is a pivotal systems role, not merely focused on infrastructure support. Your contributions will significantly influence our compute utilization efficiency, scalability of models across thousands of GPUs, and the speed of research iterations.
Your Responsibilities
Oversee training performance from start to finish
- Analyze and enhance the performance of large-scale multimodal training encompassing vision, video, proprioception, actions, and language.
- Create systematic performance attributions by breaking down step-time into compute, communication, and input pipeline, along with scaling curves for various cluster sizes and identifying key bottlenecks.
- Drive quantifiable improvements across:
- Distributed efficiency (e.g., communication and compute overlap, bucketization, topology-aware mapping, and parallelism strategies).
- Compute efficiency (e.g., identifying kernel hotspots, operator fusion, attention optimization, and minimizing framework/runtime overhead).
- Memory efficiency (e.g., activation checkpointing, sequence packing, and reducing fragmentation).
Design training systems rather than just tuning them
- Define and refine parallelism strategies including data, tensor, pipeline, sharding, and hybrid approaches.
- Enhance execution efficiency through communication scheduling, graph capture, execution optimization, and runtime enhancements.
- Contribute to the overall system architecture with innovative solutions.

