companyPhysical Intelligence logo

Machine Learning Infrastructure Engineer - Supercomputing

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

We are looking for candidates who have:A strong background in machine learning infrastructure, distributed systems, or related fields. Proficiency in scheduling algorithms and resource management strategies. Experience with GPU and TPU clusters, particularly in a multi-cloud environment. Knowledge of system optimization and monitoring tools. Exceptional problem-solving skills and a proactive mindset. A collaborative spirit to work effectively within multi-disciplinary teams.

About the job

At Physical Intelligence, we are pioneering general-purpose AI applications for the physical world. Our innovative approach involves orchestrating thousands of accelerators across a diverse ecosystem of GPU and TPU clusters, which encompass various hardware generations, cloud platforms, and cluster configurations.

Researchers frequently encounter challenges in identifying the optimal cluster for their tasks, understanding resource availability, and configuring their workloads efficiently. This process is not scalable. To enhance productivity, we require an intelligent scheduling and compute system that can automatically determine the best job placements based on availability, hardware compatibility, cost considerations, and priority levels, allowing researchers to concentrate on their scientific endeavors.

This position encompasses the complete ownership of this challenge: the development of scheduling systems, placement logic, cluster management frameworks, and operational tools essential for seamless operations.

This role is distinct from traditional cloud DevOps; it focuses on resource allocation intelligence, utilization efficiency, fault tolerance, and ensuring a smooth experience for large-scale distributed training.

About the Team

The ML Infrastructure team is dedicated to bolstering and accelerating Physical Intelligence’s fundamental modeling initiatives by creating systems that ensure large-scale training is reliable, reproducible, and efficient. You will collaborate closely with the ML Infrastructure, data platform, and research teams to eliminate compute scheduling as a bottleneck.

Key Responsibilities

- Lead Intelligent Job Scheduling and Placement: Design and implement multi-tenant scheduling systems that automatically allocate training jobs to the most suitable cluster based on hardware specifications, topology, availability, cost, and priority. Facilitate equitable resource sharing across teams and projects through quota management, priority tiers, and preemption policies. Simplify cluster discrepancies so researchers can submit jobs without needing detailed knowledge of cluster specifics.

- Enhance Multi-cluster Orchestration: Develop the control plane responsible for overseeing the job lifecycle across various clusters (including mixed GPU/TPU setups, multi-generational hardware, both on-premises and cloud-based) and enable effortless job migration, failover, and rescheduling.

- Optimize Accelerator Utilization and Performance: Continuously monitor and enhance GPU/TPU usage across the entire fleet. Apply priority, preemption, queuing, and fairness strategies that balance research momentum with cost efficiency.

- Guarantee Scalability and Stability: Implement fault detection, automatic recovery mechanisms, and resilience strategies for long-running multi-node training tasks. Oversee health checks, node management, and scaling strategies to ensure optimal performance.

About Physical Intelligence

Physical Intelligence is at the forefront of developing general-purpose AI for real-world applications, focusing on creating advanced solutions that harness the power of artificial intelligence to enhance physical processes and environments.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.