companyPhysical Intelligence logo

Machine Learning Infrastructure Engineer

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

- Solid foundation in software engineering principles and proven experience in building ML training infrastructure or internal platforms.- Practical experience with large-scale training in JAX (preferred) or PyTorch.- Knowledge of distributed training, multi-host configurations, data loaders, and evaluation pipelines.- Experience managing training workloads on cloud platforms (e.g., SLURM, Kubernetes, GCP TPU/GKE, AWS).- Strong debugging skills and the ability to optimize performance bottlenecks.

About the job

As a Machine Learning Infrastructure Engineer at Physical Intelligence, you will play a vital role in enhancing and optimizing our training systems and core model code. You will take ownership of critical infrastructure for large-scale training, which includes managing GPU/TPU compute, orchestrating jobs, and developing reusable and efficient JAX training pipelines. Collaborating closely with researchers and model engineers, you will help transform innovative ideas into experiments and subsequently into production training runs.

This position is hands-on and offers significant leverage at the intersection of machine learning, software engineering, and scalable infrastructure.

The Team

Our ML Infrastructure team is dedicated to supporting and accelerating Physical Intelligence's core modeling initiatives by building systems that ensure large-scale training is reliable, reproducible, and efficient. The team collaborates with research, data, and platform engineers to guarantee that models can seamlessly transition from prototype to production-grade training runs.

Key Responsibilities

- Manage training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, which includes scheduling, job management, checkpointing, and performance metrics/logging.

- Expand distributed training: Collaborate with researchers to efficiently scale JAX-based training across TPU and GPU clusters.

- Enhance performance: Profile and optimize memory usage, device utilization, throughput, and distributed synchronization to maximize efficiency.

- Facilitate rapid iteration: Develop abstractions for launching, monitoring, debugging, and reproducing experiments.

- Oversee compute resources: Ensure optimal allocation and utilization of cloud-based GPU/TPU compute resources while managing costs effectively.

- Collaborate with researchers: Translate research requirements into infrastructure capabilities and promote best practices for large-scale training.

- Contribute to core training code: Evolve the JAX model and training code to accommodate new architectures, modalities, and evaluation metrics.

About Physical Intelligence

Physical Intelligence is at the forefront of advancing machine learning technologies to create scalable and efficient solutions. Our team is committed to fostering innovation and driving impactful research that transforms industries.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.