company

Principal Machine Learning Engineer - Training Systems

Rhoda AIPalo Alto
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

We are looking for candidates with a strong background in machine learning and systems engineering. The ideal candidate should possess:Experience with large-scale machine learning systems and multimodal training. Proficiency in performance optimization techniques across distributed computing environments. Strong analytical skills for diagnosing and improving training performance. Expertise in designing training systems and defining strategies for scaling. Excellent problem-solving skills and the ability to work collaboratively in a fast-paced environment.

About the job

At Rhoda AI, we are pioneering the development of a comprehensive foundation for the next generation of humanoid robots. Our focus spans high-performance, software-defined hardware to advanced foundational models and video world models that govern robot functionality. Our robots are engineered to be versatile, capable of navigating intricate, real-world environments and tackling scenarios not previously encountered in training. We stand at the crossroads of large-scale learning, robotics, and systems, bolstered by a research team comprising experts from prestigious institutions such as Stanford, Berkeley, and Harvard. Our ambition is not merely to add features; we are crafting a revolutionary computing platform for physical tasks, underpinned by over $400 million in funding, driving aggressive investments in research & development, hardware innovation, and scaling up manufacturing to bring our vision to fruition.

Role Overview

We are in search of a Principal Machine Learning Systems Engineer to take charge of our training systems' performance from start to finish. You will be instrumental in defining the scaling of our model training, enhancing efficiency, scalability, and accuracy across extensive multimodal training environments. This is a pivotal systems role, not merely focused on infrastructure support. Your contributions will significantly influence our compute utilization efficiency, scalability of models across thousands of GPUs, and the speed of research iterations.

Your Responsibilities

  • Oversee training performance from start to finish

    • Analyze and enhance the performance of large-scale multimodal training encompassing vision, video, proprioception, actions, and language.
    • Create systematic performance attributions by breaking down step-time into compute, communication, and input pipeline, along with scaling curves for various cluster sizes and identifying key bottlenecks.
    • Drive quantifiable improvements across:
      • Distributed efficiency (e.g., communication and compute overlap, bucketization, topology-aware mapping, and parallelism strategies).
      • Compute efficiency (e.g., identifying kernel hotspots, operator fusion, attention optimization, and minimizing framework/runtime overhead).
      • Memory efficiency (e.g., activation checkpointing, sequence packing, and reducing fragmentation).
  • Design training systems rather than just tuning them

    • Define and refine parallelism strategies including data, tensor, pipeline, sharding, and hybrid approaches.
    • Enhance execution efficiency through communication scheduling, graph capture, execution optimization, and runtime enhancements.
    • Contribute to the overall system architecture with innovative solutions.

About Rhoda AI

Rhoda AI is at the forefront of creating a transformative computing platform for physical work through humanoid robots. With a commitment to cutting-edge research and collaboration with top-tier experts, we are redefining the capabilities of robots in real-world applications.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.