companyPhysical Intelligence logo

Machine Learning Infrastructure Engineer (TPU/JAX/Optimization)

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

- Proficient software engineering skills with a background in building ML training infrastructure or internal platforms.- Demonstrated experience with large-scale training using JAX (preferred) or PyTorch.- Understanding of distributed training, multi-host setups, data loaders, and evaluation pipelines.- Proven ability to manage training workloads on cloud platforms such as SLURM, Kubernetes, GCP TPU/GKE, or AWS.- Strong debugging skills to identify and optimize performance bottlenecks across the training stack.

About the job

Join our team as a Machine Learning Infrastructure Engineer where you will play a pivotal role in enhancing and scaling our training systems and core model code. You will be responsible for managing critical infrastructure that supports large-scale training processes, including GPU/TPU compute management and job orchestration, while developing reusable and efficient JAX training pipelines. Collaborating closely with researchers and model engineers, you'll be instrumental in translating innovative ideas into practical experiments, and from there, into production training runs.

This hands-on position merges the realms of machine learning, software engineering, and scalable infrastructure to deliver impactful results.

The Team

Our ML Infrastructure team is dedicated to bolstering and accelerating core modeling efforts at Physical Intelligence by creating reliable, reproducible, and fast systems for large-scale training. We collaborate with research, data, and platform engineers to ensure seamless scaling from prototypes to production-grade training runs.

Your Responsibilities

- Infrastructure Ownership: Design, implement, and maintain systems for large-scale model training, focusing on scheduling, job management, checkpointing, and metrics/logging.

- Distributed Training Scaling: Collaborate with researchers to facilitate JAX-based training across TPU and GPU clusters with ease.

- Performance Optimization: Profile and enhance memory utilization, device usage, throughput, and distributed synchronization.

- Rapid Iteration Enablement: Develop abstractions for launching, monitoring, debugging, and reproducing experiments efficiently.

- Compute Resource Management: Ensure effective allocation and use of cloud-based GPU/TPU resources while managing costs.

- Research Collaboration: Convert research requirements into infrastructure capabilities and advocate for best practices in large-scale training.

- Core Training Code Contribution: Evolve JAX model and training code to accommodate new architectures, modalities, and evaluation metrics.

About Physical Intelligence

At Physical Intelligence, we are at the forefront of combining advanced technology with human-centric design to create intelligent systems that enhance everyday life. Our mission is to innovate and develop solutions that empower individuals and organizations through the power of machine learning and AI.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.