company

Principal Machine Learning Engineer - Training Platform

Rhoda AIPalo Alto
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

Key Responsibilities:Take ownership of the training job lifecycle. Design and implement systems for job launch and configuration, monitoring and state tracking, automatic retries, and failure recovery. Create clean, scalable interfaces for running distributed training, including CLI, SDK, and standardized launch templates across model families. Develop robust checkpointing and recovery systems. Build reliable checkpointing systems that are efficient and flexible, supporting sharded and distributed models. Facilitate seamless resumption from failures, with partial recovery capabilities and consistent state across distributed jobs. Enhance training reproducibility and debuggability. Create systems for experiment configuration and versioning, tracking training states, metrics, and lineage to ensure reproducible outcomes.

About the job

At Rhoda AI, we are pioneering the development of a comprehensive full-stack platform for the next generation of humanoid robots. Our innovative approach encompasses high-performance, software-defined hardware along with foundational and video world models that empower our robotic systems. Our robots are engineered as versatile generalists, adept at navigating intricate, real-world scenarios, including those not encountered during training. Collaborating with a distinguished research team from Stanford, Berkeley, Harvard, and other leading institutions, we operate at the forefront of large-scale learning, robotics, and systems engineering. With over $400M in funding, we are aggressively investing in research and development, hardware innovation, and scaling up manufacturing to bring our vision to life.

We are on the lookout for a Staff / Principal Machine Learning Engineer to take charge of our training platform. This pivotal system is essential for ensuring that large-scale training is reliable, reproducible, and straightforward to execute. You will play a crucial role in defining the lifecycle of training jobs, including their launch, tracking, recovery, and debugging across our clusters. Your contributions will enable researchers to innovate rapidly without infrastructure hindrances.

In this role, you will be at the heart of enhancing research efficiency: when a training job fails, your system will allow for automatic recovery; when experiments become challenging to reproduce, you will implement effective solutions; and when GPU hours are squandered, you will ensure visibility and preventative measures are in place.

About Rhoda AI

Rhoda AI is at the cutting edge of robotics, focused on developing a multifaceted platform that empowers the next generation of humanoid robots. With significant investment in R&D and a world-class research team, we are committed to revolutionizing the field of robotics and machine learning.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.