Machine Learning Infrastructure Engineer (TPU/JAX/Optimization)

Physical IntelligenceSan Francisco

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

- Proficient software engineering skills with a background in building ML training infrastructure or internal platforms.- Demonstrated experience with large-scale training using JAX (preferred) or PyTorch.- Understanding of distributed training, multi-host setups, data loaders, and evaluation pipelines.- Proven ability to manage training workloads on cloud platforms such as SLURM, Kubernetes, GCP TPU/GKE, or AWS.- Strong debugging skills to identify and optimize performance bottlenecks across the training stack.

About the job

This hands-on position merges the realms of machine learning, software engineering, and scalable infrastructure to deliver impactful results.

The Team

Our ML Infrastructure team is dedicated to bolstering and accelerating core modeling efforts at Physical Intelligence by creating reliable, reproducible, and fast systems for large-scale training. We collaborate with research, data, and platform engineers to ensure seamless scaling from prototypes to production-grade training runs.

Your Responsibilities

- Infrastructure Ownership: Design, implement, and maintain systems for large-scale model training, focusing on scheduling, job management, checkpointing, and metrics/logging.

- Distributed Training Scaling: Collaborate with researchers to facilitate JAX-based training across TPU and GPU clusters with ease.

- Performance Optimization: Profile and enhance memory utilization, device usage, throughput, and distributed synchronization.

- Rapid Iteration Enablement: Develop abstractions for launching, monitoring, debugging, and reproducing experiments efficiently.

- Compute Resource Management: Ensure effective allocation and use of cloud-based GPU/TPU resources while managing costs.

- Research Collaboration: Convert research requirements into infrastructure capabilities and advocate for best practices in large-scale training.

- Core Training Code Contribution: Evolve JAX model and training code to accommodate new architectures, modalities, and evaluation metrics.

About Physical Intelligence

At Physical Intelligence, we are at the forefront of combining advanced technology with human-centric design to create intelligent systems that enhance everyday life. Our mission is to innovate and develop solutions that empower individuals and organizations through the power of machine learning and AI.

Similar jobs

1 - 20 of 5,367 Jobs

Search for Mission Optimization Engineer Intern

5,367 results

Select all on this page (20)

Apply

Mission Optimization Engineer Intern

Planet Labs PBC

Internship|On-site|San Francisco, CA

Join Planet Labs PBC as a Mission Optimization Engineer Intern! As a part of our dynamic team, you will engage in innovative projects that leverage satellite data to optimize mission operations. This is an exciting opportunity to apply your technical skills in a real-world setting and contribute to our mission of making change visible on Earth.

Apr 2, 2026

Apply

In-Person Volunteer Opportunity at Mission Graduates

Mission Graduates

Part-time|On-site|San Francisco

At Mission Graduates, we are dedicated to enhancing the educational prospects of K-12 students in San Francisco's Mission District. Our commitment is reflected in our extensive after-school support services that annually benefit over 1,500 children, youth, and families. We envision a future where college education is the norm for families in the Mission, fostering a culture of academic excellence.We invite enthusiastic individuals to join our volunteer team at Mission Graduates’ Extended Day Program and Beacon Sites. This is a unique chance to make a significant impact on the lives of our students while gaining invaluable experience. Volunteer roles include:Elementary School Level TutoringMiddle School Level TutoringEnglish as a Second Language (ESL) TutoringSite SupportSpanish TutoringEmbrace the opportunity to create change and inspire the next generation through education!

Mar 1, 2026

Apply

Technical Intern - GPU Optimization and AI Infrastructure

Wafer

Internship|On-site|San Francisco

About the RoleWe invite you to join our innovative team at Wafer as a Technical Intern, where you will have the opportunity to shape the future of inference, GPU optimization, and AI infrastructure. As a full-time engineer, you will collaborate closely with our team to define our technical direction and develop the core systems that drive our GPU optimization platform.Your ResponsibilitiesDesign and implement scalable infrastructure for AI model training and inference.Make pivotal technical decisions and influence architectural choices.

Oct 15, 2025

Apply

Mission Operations Engineer I

Astranis

Full-time|On-site|San Francisco

Join the innovative team at Astranis as a Mission Operations Engineer I. In this role, you will be pivotal in managing satellite operations, ensuring the seamless execution of mission objectives. Your responsibilities will include monitoring satellite performance, troubleshooting operational issues, and collaborating with cross-functional teams to optimize mission outcomes.

Mar 27, 2026

Apply

Senior Frontend Engineer - Performance Optimization

ClickUp

Full-time|On-site|United States of America

At ClickUp, we're not just developing software; we're shaping the future of work! In an era dominated by work sprawl, we identified a more efficient way. This led us to create the first truly integrated AI workspace, consolidating tasks, documents, chat, calendar, and enterprise search, all enhanced by context-driven AI. Our mission is to empower millions of teams to escape silos, reclaim their time, and reach unprecedented levels of productivity. At ClickUp, you'll have the chance to learn, innovate, and leverage AI in transformative ways that will not only influence our product but also the broader landscape of work itself. Join a daring, pioneering team that's challenging the limits of what's possible! We are on the lookout for a technical leader in SaaS client performance who is passionate about enhancing the customer experience through top-tier performance solutions. As a Senior Performance Engineer, you will spearhead comprehensive strategies to optimize application speed, memory utilization, and reliability across our entire platform. You will be empowered to analyze, diagnose, and address performance bottlenecks wherever they arise—be it front-end, back-end, or infrastructure—ensuring ClickUp remains the fastest and most reliable productivity platform available.The ideal candidate is a hands-on authority in browser and NodeJS performance, with a thorough understanding of how code influences rendering, memory management, and overall user experience. You excel in solving intricate challenges, collaborating across teams, and establishing new benchmarks for performance excellence. If you're driven to make a significant impact for millions of users, this is your chance to lead at scale.Your Responsibilities:Conduct root cause analysis on client performance issues and perform post-mortems.Profile application code to identify inefficient algorithms, memory leaks, and other issues; propose and implement effective solutions.Establish performance monitoring, alerting, and dashboards to proactively detect and resolve client performance challenges.Examine client traffic patterns, load testing outcomes, and other metrics to set benchmarks and drive enhancements.Champion performance best practices and set performance standards across the engineering organization.Identify infrastructure upgrades (caching, CDNs, database optimization) to elevate the client experience.Collaborate with development teams to incorporate performance as a core requirement in the development of new features.

Dec 22, 2025

Apply

Senior Engineer 2: Inference Optimizations

DigitalOcean, Inc.

Full-time|Remote|San Francisco

Join DigitalOcean as a Senior Engineer focused on Inference Optimizations, where you will play a pivotal role in enhancing our AI and machine learning capabilities. Collaborate with a talented team to develop cutting-edge solutions that optimize inference processes across various applications.

Mar 17, 2026

Apply

Backend Engineer - Mission Operations Services

Loft Orbital

Full-time|On-site|Golden, CO or San Francisco

Join Loft Orbital as a Backend Engineer within our Mission Operations Services (MOS) team. In this role, you will be responsible for the development and scaling of backend services that empower Loft operators and clients to command, monitor, and automate our satellite fleet using Cockpit, our advanced mission control system. Your work will involve transforming high-level directives into secure and reliable operations in orbit.Collaboration is key as you will engage with various satellite platforms and work closely with teams specializing in backend development, flight software, embedded systems, testing, and satellite operations. This hands-on position will allow you to not only create new systems but also troubleshoot live missions. We also provide SatDevOps training, enabling you to use the tools you develop to actively operate spacecraft.If you're passionate about building the backend systems that control satellites and enjoy both writing code and supporting mission flights, we invite you to connect with us.

Feb 26, 2026

Apply

Mission Control Engineer - Remote Opportunity

Fastly, Inc.

Full-time|$77.1K/yr - $108.8K/yr|Remote|Denver, CO; San Francisco, CA

Fastly is dedicated to enhancing connectivity for individuals by delivering exceptional digital experiences. Our cutting-edge edge cloud platform empowers customers to rapidly and securely create remarkable applications by processing, serving, and safeguarding their data as close to the end-users as possible — right at the edge of the Internet. Designed to leverage the modern internet, our platform is programmable and supports agile software development methodologies. We are proud to serve renowned organizations worldwide, including GitHub, Yelp, Paramount, and JetBlue.Join us in our mission to foster a more trustworthy Internet.Posting Open Date: 3/10/2026Anticipated Posting Close Date*: 5/10/2026*Job posting may close early due to the volume of applicants.Mission Control Engineer (MCE)As part of the Fastly Edge Cloud Platform, our Mission Control Program (MCP) plays a pivotal role in ensuring operational excellence and customer satisfaction.In the role of Mission Control Engineer, you will be responsible for maintaining service quality, converting extensive monitoring data into precise, actionable insights to preemptively address potential issues that could affect our strategic accounts and high-profile digital partners. You will not only respond to incidents but also drive problem management and continuous improvement initiatives to ensure our platform's resilience for organizations in our Performance and Focus Centers. In the event of significant platform disruptions, you will advocate for these accounts, collaborating with incident commanders to ensure that mitigation strategies prioritize the stability of unique traffic patterns. Working closely with various engineering teams focused on network delivery and compute services, you will provide proactive troubleshooting, coordinate maintenance, and lead contributing factor analyses necessary to manage complex traffic effectively.

Mar 20, 2026

Apply

Energy Optimization Engineering Internship

Redwood Materials

Internship|$41/hr - $54/hr|On-site|San Francisco, California, United States

About Redwood MaterialsRedwood Materials is at the forefront of localizing a global battery supply chain that integrates recovery, reuse, and recycling—ensuring critical minerals remain in circulation while propelling the energy transition. Founded in 2017, we are pioneering low-cost, large-scale energy storage and manufacturing battery materials in the U.S. for the first time, all sourced from batteries that are currently in circulation.Essential Duties: As an Energy Optimization Engineering Intern, you will play a crucial role in developing the predictive "intelligence layer" designed to manage energy for AI Data Centers and microgrids. Under the mentorship of senior engineers, you will assist in constructing and validating time-series forecasting models for GPU power loads and market prices, integrating these insights into Mixed-Integer Programming (MIP) prototypes. You will work alongside cloud software teams to test these "forecast-informed" algorithms in a cloud-native setting, aiding in the simulation and backtesting of energy management strategies. Your goal is to enhance the accuracy and efficiency of our Energy Management System (EMS), gaining practical experience in "value-stacking" and real-world energy optimization. This internship is for Summer 2026.Responsibilities Include:

Mar 18, 2026

Apply

Core Data Engineer for Platform Optimization

Hilbert Technologies

Full-time|On-site|San Francisco

Join Hilbert, a pioneering data science-driven growth engine that empowers B2C teams with predictive insights into user behaviors, revenue streams, and sustainable growth strategies. Our innovative platform reduces decision-making cycles from months to mere minutes.Serving a range of clients from Fortune 10 companies to cherished brands such as FreshDirect, Blank Street, and Levain Bakery, our platform is the backbone of their growth strategies. We are also collaborating with leading AI firms to enhance our capabilities.We are on the lookout for a Core Data Engineer to lay the groundwork for our integration team, enabling them to operate at 10x their current speed. While others may focus on connecting systems, your mission is to ensure the integrity and flow of our data refinery. You will manage our 'shared brain', optimize ClickHouse clusters, and develop monitoring systems that proactively identify pipeline issues before they impact our customers.ROLE OVERVIEWEfficiency Expert: You thrive on minimizing query times and memory usage, always considering scalability and reliability.Dagster Enthusiast: You bring experience or a strong desire to excel in Dagster for intricate orchestration.OLAP Specialist: You possess a deep understanding of ClickHouse (or similar platforms like Druid/Pinot) and know how to structure data for optimal analytical performance.Software Development Advocate: You treat data code with the same rigor as production software. CI/CD, unit testing, and modular code are non-negotiable for you.Global Collaboration: You can coordinate with our global team, providing at least 5 hours of overlap.WHO EXCELS IN THIS ROLEYou will optimize the performance of our ClickHouse implementation.You will devise configuration schemas that facilitate controlled data ingestion pipelines.You will build and sustain shared Python libraries for data synchronization across the engineering team.You will enhance our canonical data models to balance scalability with performance.You will architect our monitoring, alerting, and observability systems to ensure high confidence in our data integrity.You will design new integration pipelines by exploring novel data sources and sync mechanisms.

Feb 26, 2026

Apply

Software Engineer, Inference - Performance Optimization

OpenAI

Full-time|On-site|San Francisco

Role overview This Software Engineer position at OpenAI focuses on inference and performance optimization. Based in San Francisco, the role centers on increasing the speed and efficiency of advanced AI systems. Collaboration with experienced engineers is a key part of the work, with an emphasis on refining AI performance. What you will do Work on optimizing the performance of AI inference systems Collaborate with other engineers to improve efficiency and speed Contribute to solutions that enhance AI system capabilities Location This role is based in San Francisco.

Apr 25, 2026

Apply

Mission Software Engineer for Public Sector at Scale AI

Scale AI

Full-time|$138K/yr - $292.6K/yr|On-site|Boston, Massachusetts ; Honolulu, HI; San Diego, CA; San Francisco, CA; St. Louis, MO; New York, NY; Washington, DC

Join Scale AI as a passionate and skilled Mission Software Engineer within our innovative Federal Engineering team. In this pivotal position, you will directly contribute to enhancing our government clients' capabilities by designing and implementing tailored solutions. Our advanced, scalable platform is essential for these solutions, and your technical expertise will be crucial in ensuring seamless integrations with existing client infrastructures, enhancing their workflows.Your Responsibilities Include:Engaging directly with clients to identify their needs and translating them into functional features within Scale’s platform.Being prepared for over 50% travel or potential relocation to key client sites.Working collaboratively with diverse teams to define and implement backend solutions that cater to the specific requirements of government agencies operating in secure environments.Executing comprehensive data integrations, ensuring synchronization of client data with Scale’s platform.Deploying and maintaining Scale software at client locations.Creating features requested by clients, working closely to ensure high satisfaction.Building robust and reliable backend systems that serve as standalone products, empowering clients to advance their AI initiatives.Actively participating in client engagements and collaborating with stakeholders to capture requirements and deliver cutting-edge solutions.This role requires an active TS/SCI security clearance or the capability to obtain one.

Mar 26, 2026

Apply

Research Engineer - AI Performance & Kernel Optimization

Zyphra

Full-time|On-site|San Francisco

Join Zyphra as a Research Engineer specializing in AI Performance and Kernel Optimization. In this role, you will work at the forefront of AI technologies, developing and optimizing kernel solutions that enhance the performance of our systems. You will collaborate with cross-functional teams, leveraging your expertise to drive innovation and efficiency.

Mar 16, 2026

Apply

Staff Software Engineer - Simulation Capacity Optimization

Waymo LLC

Full-time|On-site|Mountain View, CA, USA; San Francisco, CA, USA; New York, NY, USA

Join Waymo as a Staff Software Engineer specializing in Simulation Capacity Optimization. In this pivotal role, you will leverage your expertise in software engineering and simulation technology to enhance our autonomous vehicle systems. Your contributions will be vital in optimizing the performance and efficiency of our simulations, ensuring that our vehicles are equipped to navigate the complexities of real-world environments.

Apr 7, 2026

Apply

Machine Learning Engineer, Marketplace Optimization

DoorDash

Full-time|$137.1K/yr - $246.8K/yr|On-site|San Francisco, CA; Sunnyvale, CA

Join DoorDash as a Machine Learning Engineer, where you'll be pivotal in designing, building, and optimizing large-scale machine learning systems within our Ads Delivery funnel. Your expertise will contribute to enhancing our Ads Marketplace, ensuring a balanced and efficient ecosystem for advertisers and consumers alike. Work with advanced AI and machine learning techniques to dynamically optimize bidding, auction design, budget pacing, forecasting, and ad experimentation. This role offers a unique opportunity to influence our innovative advertising products as we expand into new verticals such as Grocery and Retail.

Feb 5, 2026

Apply

Senior Software Engineer - Logistics Optimization

Sprinter Health

Full-time|Hybrid|San Francisco, CA

About Sprinter HealthAt Sprinter Health, we are dedicated to transforming healthcare accessibility by delivering essential services directly to patients' homes. Across the U.S., nearly 30% of individuals forgo preventive and chronic care simply due to accessibility challenges. This often leads to emergency room visits, which contribute to over $300 billion in unnecessary healthcare expenses annually.Leveraging cutting-edge technology akin to that used by leading marketplace and last-mile logistics platforms, we provide care right where it’s needed, particularly for vulnerable populations. To date, we have positively impacted over 2 million patients across 22 states, conducted over 130,000 in-home visits, and achieved an impressive Net Promoter Score (NPS) of 92. Our dynamic team of clinicians, tech experts, and operations professionals has successfully raised over $125 million from esteemed investors such as a16z, General Catalyst, GV, and Accel, ensuring a solid multi-year runway for our growth.About the RoleWe are seeking a Senior Software Engineer to join our Logistics Optimization team, where you will address some of the most challenging algorithmic and operational issues in the healthcare sector. In this pivotal role, you will design and implement systems that efficiently balance clinician availability, patient demand, and routing logistics, forming the backbone of Sprinter's in-home care delivery framework. This position demands deep technical expertise and offers high-impact opportunities, focusing on the convergence of operations research, simulation, and scalable distributed systems.Office LocationAs a hybrid company based in the Bay Area, we operate from offices in San Francisco and Menlo Park. We prioritize work-life balance and are committed to providing flexibility when needed.Key Responsibilities:Design and implement algorithms that optimize clinician routing, scheduling, and dispatching on a national scale.Develop simulations that accurately model demand, capacity, and patient behaviors within real-world constraints.Create predictive models to manage cancellations, no-shows, and optimize overbooking strategies.Collaborate with product and operations teams to translate complex logistics challenges into scalable software solutions.Prototype and deploy forecasting and optimization models within a distributed environment.Engage in continuous improvement practices to enhance system performance and reliability.

Feb 9, 2026

Apply

Process Engineering Intern

ASM International

Internship|On-site|Education fields required

Embark on an exciting journey with ASM, where innovative technology converges with a collaborative atmosphere.For over 55 years, ASM has led the way in technological advancements, pioneering innovations in the semiconductor industry. Our diverse team of over 4,500 professionals from 70 different nationalities is integral to shaping future technologies such as 5G, cloud computing, artificial intelligence, and autonomous vehicles. We are not just a tech company; we are committed to fostering diversity, promoting inclusion, and prioritizing sustainability to make a meaningful impact worldwide. Our development programs are designed to support your personal and professional growth, encouraging you to push the limits of innovation to realize your full potential.We are currently looking for an enthusiastic intern to assist with advanced metrology development initiatives aimed at enhancing measurement precision and modeling stability across vital characterization techniques. This position is perfect for individuals eager to delve into semiconductor process characterization, data analysis, and design of experiments (DOE) methodologies.

Apr 13, 2026

Apply

Compute Optimization Researcher/Engineer

OpenAI

Full-time|Hybrid|San Francisco

Team Overview The infrastructure team at OpenAI manages the core systems that support AI workloads worldwide. As OpenAI expands its compute capabilities across company-owned data centers, cloud environments, and strategic partnerships, the need for careful planning and resource management grows. Reliable and cost-effective compute operations depend on this foundation. The Compute Optimization group operates at the intersection of engineering, operations, finance, and infrastructure strategy. This team develops models, decision tools, and planning systems to improve how compute resources are scheduled, deployed, and scaled as global needs shift. Role Overview OpenAI is hiring a Compute Optimization Researcher/Engineer to help maximize the use of compute capacity across the organization. This role addresses complex optimization challenges related to capacity allocation, demand forecasting, cluster planning, workload placement, and infrastructure utilization. Work includes building mathematical models, developing software systems, and collaborating with other teams to improve planning and use of compute resources. Areas of focus span GPU clusters, networking, storage, and data center infrastructure. Candidates with experience in operations research, optimization, applied mathematics, infrastructure systems, or large-scale capacity planning will be well-suited for this position. Location and Work Model This position is based in San Francisco, CA. OpenAI follows a hybrid schedule with three days per week in the office. Relocation assistance is offered.

Apr 27, 2026

Apply

Senior GenAI Research Engineer - Optimization and Kernels

Databricks

Full-time|$166K/yr - $225K/yr|On-site|San Francisco, California

At Databricks, we are dedicated to empowering data teams to tackle the world's most challenging problems, from detecting security threats to advancing cancer drug development. We achieve this by offering the premier data and AI platform, allowing our customers to concentrate on their mission-critical challenges. The Mosaic AI organization assists companies in developing AI models and systems utilizing their own data, employing technologies that range from training large language models (LLMs) from the ground up to employing advanced retrieval methods for enhanced generation. We pride ourselves on pushing the boundaries of science and operationalizing our innovations. Mosaic AI believes that a company’s AI models hold intrinsic value, akin to any other core intellectual property, and that superior AI models should be accessible to all. Job Overview As a research engineer in the Scaling team, you will stay abreast of the latest advancements in deep learning and pioneer new methodologies that surpass the current state of the art. You will collaborate with a diverse team of researchers and engineers, sharing insights and expertise. Most importantly, you will be passionate about our customers, striving to ensure their success in implementing cutting-edge LLMs and AI systems by translating our scientific knowledge into practical applications. Your Impact Enhance performance through innovative optimization techniques, including kernel fusion, mixed precision, memory layout optimization, tiling strategies, and tensorization tailored for training-specific patterns. Design, implement, and optimize high-performance GPU kernels for training workloads, including attention mechanisms, custom layers, gradient computations, and activation functions, specifically for NVIDIA architectures. Create and implement distributed training frameworks for large language models, incorporating parallelism strategies (data, tensor, pipeline, ZeRO-based) and optimized communication patterns for gradient synchronization and collective operations. Profile, debug, and optimize comprehensive training workflows to pinpoint and resolve performance bottlenecks, utilizing memory optimization techniques such as activation checkpointing, gradient sharding, and mixed precision training.

Jan 30, 2026

Apply

Machine Learning Infrastructure Engineer (TPU/JAX/Optimization)

Physical Intelligence

Full-time|On-site|San Francisco

Join our team as a Machine Learning Infrastructure Engineer where you will play a pivotal role in enhancing and scaling our training systems and core model code. You will be responsible for managing critical infrastructure that supports large-scale training processes, including GPU/TPU compute management and job orchestration, while developing reusable and efficient JAX training pipelines. Collaborating closely with researchers and model engineers, you'll be instrumental in translating innovative ideas into practical experiments, and from there, into production training runs.This hands-on position merges the realms of machine learning, software engineering, and scalable infrastructure to deliver impactful results.The TeamOur ML Infrastructure team is dedicated to bolstering and accelerating core modeling efforts at Physical Intelligence by creating reliable, reproducible, and fast systems for large-scale training. We collaborate with research, data, and platform engineers to ensure seamless scaling from prototypes to production-grade training runs.Your Responsibilities- Infrastructure Ownership: Design, implement, and maintain systems for large-scale model training, focusing on scheduling, job management, checkpointing, and metrics/logging.- Distributed Training Scaling: Collaborate with researchers to facilitate JAX-based training across TPU and GPU clusters with ease.- Performance Optimization: Profile and enhance memory utilization, device usage, throughput, and distributed synchronization.- Rapid Iteration Enablement: Develop abstractions for launching, monitoring, debugging, and reproducing experiments efficiently.- Compute Resource Management: Ensure effective allocation and use of cloud-based GPU/TPU resources while managing costs.- Research Collaboration: Convert research requirements into infrastructure capabilities and advocate for best practices in large-scale training.- Core Training Code Contribution: Evolve JAX model and training code to accommodate new architectures, modalities, and evaluation metrics.

Jan 23, 2026

Create account — see all 5,367 results