About the job
At Rhoda AI, we are pioneering the development of a comprehensive full-stack platform for the next generation of humanoid robots. Our innovative approach encompasses high-performance, software-defined hardware along with foundational and video world models that empower our robotic systems. Our robots are engineered as versatile generalists, adept at navigating intricate, real-world scenarios, including those not encountered during training. Collaborating with a distinguished research team from Stanford, Berkeley, Harvard, and other leading institutions, we operate at the forefront of large-scale learning, robotics, and systems engineering. With over $400M in funding, we are aggressively investing in research and development, hardware innovation, and scaling up manufacturing to bring our vision to life.
We are on the lookout for a Staff / Principal Machine Learning Engineer to take charge of our training platform. This pivotal system is essential for ensuring that large-scale training is reliable, reproducible, and straightforward to execute. You will play a crucial role in defining the lifecycle of training jobs, including their launch, tracking, recovery, and debugging across our clusters. Your contributions will enable researchers to innovate rapidly without infrastructure hindrances.
In this role, you will be at the heart of enhancing research efficiency: when a training job fails, your system will allow for automatic recovery; when experiments become challenging to reproduce, you will implement effective solutions; and when GPU hours are squandered, you will ensure visibility and preventative measures are in place.

