About the job
Join our team as a Machine Learning Infrastructure Engineer where you will play a pivotal role in enhancing and scaling our training systems and core model code. You will be responsible for managing critical infrastructure that supports large-scale training processes, including GPU/TPU compute management and job orchestration, while developing reusable and efficient JAX training pipelines. Collaborating closely with researchers and model engineers, you'll be instrumental in translating innovative ideas into practical experiments, and from there, into production training runs.
This hands-on position merges the realms of machine learning, software engineering, and scalable infrastructure to deliver impactful results.
The Team
Our ML Infrastructure team is dedicated to bolstering and accelerating core modeling efforts at Physical Intelligence by creating reliable, reproducible, and fast systems for large-scale training. We collaborate with research, data, and platform engineers to ensure seamless scaling from prototypes to production-grade training runs.
Your Responsibilities
- Infrastructure Ownership: Design, implement, and maintain systems for large-scale model training, focusing on scheduling, job management, checkpointing, and metrics/logging.
- Distributed Training Scaling: Collaborate with researchers to facilitate JAX-based training across TPU and GPU clusters with ease.
- Performance Optimization: Profile and enhance memory utilization, device usage, throughput, and distributed synchronization.
- Rapid Iteration Enablement: Develop abstractions for launching, monitoring, debugging, and reproducing experiments efficiently.
- Compute Resource Management: Ensure effective allocation and use of cloud-based GPU/TPU resources while managing costs.
- Research Collaboration: Convert research requirements into infrastructure capabilities and advocate for best practices in large-scale training.
- Core Training Code Contribution: Evolve JAX model and training code to accommodate new architectures, modalities, and evaluation metrics.

