Machine Learning Infrastructure Engineer

Physical IntelligenceSan Francisco

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

- Solid foundation in software engineering principles and proven experience in building ML training infrastructure or internal platforms.- Practical experience with large-scale training in JAX (preferred) or PyTorch.- Knowledge of distributed training, multi-host configurations, data loaders, and evaluation pipelines.- Experience managing training workloads on cloud platforms (e.g., SLURM, Kubernetes, GCP TPU/GKE, AWS).- Strong debugging skills and the ability to optimize performance bottlenecks.

About the job

This position is hands-on and offers significant leverage at the intersection of machine learning, software engineering, and scalable infrastructure.

The Team

Our ML Infrastructure team is dedicated to supporting and accelerating Physical Intelligence's core modeling initiatives by building systems that ensure large-scale training is reliable, reproducible, and efficient. The team collaborates with research, data, and platform engineers to guarantee that models can seamlessly transition from prototype to production-grade training runs.

Key Responsibilities

- Manage training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, which includes scheduling, job management, checkpointing, and performance metrics/logging.

- Expand distributed training: Collaborate with researchers to efficiently scale JAX-based training across TPU and GPU clusters.

- Enhance performance: Profile and optimize memory usage, device utilization, throughput, and distributed synchronization to maximize efficiency.

- Facilitate rapid iteration: Develop abstractions for launching, monitoring, debugging, and reproducing experiments.

- Oversee compute resources: Ensure optimal allocation and utilization of cloud-based GPU/TPU compute resources while managing costs effectively.

- Collaborate with researchers: Translate research requirements into infrastructure capabilities and promote best practices for large-scale training.

- Contribute to core training code: Evolve the JAX model and training code to accommodate new architectures, modalities, and evaluation metrics.

About Physical Intelligence

Physical Intelligence is at the forefront of advancing machine learning technologies to create scalable and efficient solutions. Our team is committed to fostering innovation and driving impactful research that transforms industries.

Similar jobs

1 - 20 of 7,374 Jobs

Search for Senior Software Engineer Machine Learning Infrastructure

7,374 results

Select all on this page (20)

Apply

Senior Software Engineer - Machine Learning Infrastructure

Decagon

Full-time|Remote|San Francisco

Join Decagon as a Senior Software Engineer specializing in Machine Learning Infrastructure. In this pivotal role, you will be responsible for designing and optimizing systems that support machine learning models and applications. Your expertise will help drive innovation and efficiency in our ML pipelines, ensuring that our algorithms are fast, scalable, and reliable.You'll collaborate with cross-functional teams to implement cutting-edge solutions that enhance our product offerings. If you are passionate about advancing machine learning technologies and thrive in a dynamic environment, we want to hear from you!

Mar 26, 2026

Apply

Software Engineer - Machine Learning Infrastructure

Whatnot

Full-time|On-site|San Francisco, CA

Role overview Whatnot seeks a Software Engineer specializing in Machine Learning Infrastructure to develop and maintain the systems powering its machine learning applications. This position is based in San Francisco, CA and centers on building the technical backbone that supports machine learning efforts across the company. What you will do Develop and improve frameworks that enable machine learning throughout Whatnot’s platforms. Collaborate with teams from multiple disciplines to design infrastructure that can scale as needs grow. Support seamless integration of machine learning models into existing products.

Apr 23, 2026

Apply

Machine Learning Infrastructure Engineer

Specter

Full-time|On-site|San Francisco

Company Overview At Specter, we are pioneering a software-defined "control plane" designed to enhance the real-world perception of physical assets. Our mission begins with safeguarding American businesses by providing them with comprehensive insights into their physical environments.To achieve this, we are developing a robust hardware-software ecosystem leveraging multi-modal wireless mesh sensing technology. This innovation allows us to significantly reduce the cost and time involved in sensor deployment by a factor of ten. Ultimately, our platform aims to serve as the perception engine for businesses, facilitating real-time visibility and autonomous management of their operational perimeters.Our co-founders, Xerxes and Philip, are deeply committed to empowering our partners in the rapidly evolving landscape of physical AI and robotics. We are a dynamic, rapidly expanding team comprised of talent from Anduril, Tesla, Uber, and the U.S. Special Forces.Position Overview Specter is seeking a dedicated Machine Learning Infrastructure Engineer to construct and optimize the ML systems that drive real-time perception and inference capabilities across our edge-cloud platform. This position will involve overseeing the training, deployment, and enhancement of computer vision and sensor fusion models, aimed at enabling autonomous monitoring and decision-making for our clients' physical assets.Key Responsibilities Include:Design and implement scalable ML training pipelines for computer vision applications, including object detection, tracking, classification, and segmentation.Develop efficient model serving infrastructures to facilitate real-time inference on edge devices with limited computational and power resources.Optimize models for deployment on embedded hardware, employing techniques such as quantization, pruning, TensorRT, ONNX, and CoreML.Create continuous training and evaluation systems to enhance model performance through feedback loops derived from production data.Establish data pipelines for the ingestion, labeling, versioning, and management of extensive multi-modal sensor datasets, including video, radar, lidar, and thermal data.Implement model monitoring frameworks, A/B testing methodologies, and performance analytics for deployed perception systems.Collaborate with perception researchers to transition models from research environments to scalable production across thousands of edge nodes.Construct tools and infrastructure for distributed training, hyperparameter optimization, and experiment tracking.

Oct 3, 2025

Apply

Senior/Staff Software Engineer - Machine Learning Infrastructure

Voxel

Full-time|On-site|San Francisco, CA

Role Overview Voxel is hiring a Senior or Staff Software Engineer focused on Machine Learning Infrastructure in San Francisco, CA. This position centers on building and maintaining scalable infrastructure that supports the company’s machine learning products and services. What You Will Do Design, develop, and maintain machine learning infrastructure for production systems Work with teams across engineering, product, and data to streamline ML workflows Optimize systems for performance, reliability, and operational efficiency Collaboration This role involves frequent collaboration with colleagues from multiple disciplines to ensure machine learning solutions are robust and scalable.

Apr 14, 2026

Apply

Staff Software Engineer - Machine Learning Infrastructure

Decagon

Full-time|Remote|San Francisco

Join Decagon as a Staff Software Engineer specializing in Machine Learning Infrastructure. In this role, you will play a crucial part in enhancing and optimizing our machine learning systems. You will collaborate with a talented team of engineers to build scalable and efficient infrastructure that supports our AI-driven initiatives.As a key contributor, you will leverage your expertise in software engineering and machine learning to solve complex challenges and drive innovation. Your work will impact various projects and help shape the future of our technology.

Feb 24, 2026

Apply

Backend Software Engineer - Machine Learning Infrastructure

Rockstar

Full-time|On-site|San Francisco, California, United States

Rockstar is on the hunt for a talented Backend Software Engineer to join a rapidly expanding startup that is pioneering the AI infrastructure for the next generation of smart products. This innovative company specializes in providing AI startups with the tools to design, fine-tune, evaluate, deploy, and maintain their specialized models across various domains like text, vision, and embeddings. Think of them as the 'AWS for AI models'—offering a comprehensive backend solution for fine-tuning, reinforcement learning, inference, and ongoing model maintenance. Their clientele includes Series A to C AI companies that are developing enterprise-grade products, with a simple promise: to enhance your AI systems.As a Backend Software Engineer focusing on ML Infrastructure, you will play a pivotal role in designing, building, and scaling the essential systems that facilitate extensive model training and deployment.Your responsibilities will include developing distributed training pipelines, establishing cloud-native infrastructure, and creating internal developer platforms that support fine-tuning, reinforcement learning, and inference at scale. This position uniquely combines backend engineering with machine learning systems, allowing you to work closely with ML engineers while taking ownership of production-grade infrastructure.This role is a fantastic opportunity for an early-career engineer eager to dive into real distributed systems, GPU workloads, and cutting-edge ML infrastructure—far removed from simple dashboards or CRUD applications.

Dec 22, 2025

Apply

Senior Machine Learning Infrastructure Engineer

Gridware

Full-time|On-site|San Francisco, CA

About GridwareGridware is an innovative technology firm based in San Francisco, committed to safeguarding and enhancing the electrical grid. We have pioneered an advanced class of grid management known as Active Grid Response (AGR), which focuses on monitoring the electrical, physical, and environmental aspects of the grid to improve reliability and safety. Our cutting-edge AGR platform utilizes high-precision sensors to identify potential issues early, enabling proactive maintenance and fault mitigation. This all-encompassing strategy enhances safety, minimizes outages, and promotes efficient grid operations. Supported by climate-tech and Silicon Valley investors, we are at the forefront of transforming grid management. For further details, visit www.Gridware.io.Role OverviewIn the role of Senior Machine Learning Infrastructure Engineer, you will collaborate closely with the Automation organization and the core ML, Operations, and Analytics teams to enhance and develop the infrastructure surrounding model deployment and monitoring. This position is crucial for amplifying the time-saving benefits that Gridware provides to its customers.

Dec 11, 2025

Apply

Senior Machine Learning Infrastructure Engineer

Abridge

Full-time|On-site|SF Office

About AbridgeAbridge, established in 2018, is dedicated to enhancing the understanding of healthcare through advanced AI technology. Our platform is specifically designed for medical conversations, streamlining clinical documentation processes and allowing healthcare professionals to prioritize patient care.Our robust technology converts patient-clinician dialogues into structured clinical notes in real-time, integrating seamlessly with electronic medical records (EMR). With our unique Linked Evidence approach and auditable AI framework, we are the sole entity that aligns AI-generated summaries with verified ground truths, fostering trust among healthcare providers. As leaders in generative AI within the healthcare sector, we are committed to setting benchmarks for the ethical implementation of AI across health systems.Our dynamic team comprises practicing MDs, AI researchers, PhDs, creative thinkers, technologists, and engineers, all collaborating to empower individuals and enhance the healthcare experience. We have offices located in San Francisco's Mission District, New York's SoHo, and Pittsburgh's East Liberty.The RoleAs a Senior Machine Learning Infrastructure Engineer at Abridge, you will be essential in constructing and refining the core infrastructure that supports our machine learning models. Your contributions will be crucial in boosting the scalability, efficiency, and performance of our AI solutions. You will collaborate with the Infrastructure and Research teams to build, deploy, optimize, and orchestrate our AI models.What You'll DoDesign, deploy, and maintain scalable Kubernetes clusters for AI model training and inference.Develop, optimize, and maintain high-performance ML serving and training infrastructure, ensuring minimal latency.Work alongside ML and product teams to enhance backend infrastructure for AI-driven applications, focusing on model deployment and efficiency.Improve compute-intensive workflows and maximize GPU utilization for ML tasks.Create a robust orchestration system for model APIs.Partner with leadership to formulate and execute strategies for scaling infrastructure as the company expands, guaranteeing sustained efficiency and performance.

Aug 25, 2025

Apply

Senior Machine Learning Infrastructure Engineer

Echo Neurotechnologies

Full-time|On-site|San Francisco

Company OverviewEcho Neurotechnologies is a dynamic startup revolutionizing the Brain-Computer Interface (BCI) sector. We are committed to creating innovative hardware solutions powered by artificial intelligence, with the goal of enhancing the lives of individuals with disabilities and promoting independence through advanced technology.Team CultureBecome part of a close-knit group of passionate professionals in a fast-paced environment. As part of our early-stage team, you will have the chance to influence important decisions that yield substantial, lasting results. We prioritize continuous learning and collaboration, ensuring your contributions are integral to our collective success.Job SummaryWe are on the lookout for a Senior Machine Learning Infrastructure Engineer to join our talented team. In this pivotal role, you will be responsible for designing, constructing, and scaling infrastructure that supports large-scale data processing, modeling, and analysis. You will play an essential role in developing a high-performance, production-ready ML ecosystem that facilitates swift experimentation across diverse datasets, including neural signals and behavioral data. You'll have substantial ownership of our ML R&D platform, collaborating closely with domain experts to develop new cloud infrastructure, data pipelines, and modeling workflows, ultimately leading to the creation of state-of-the-art models for neuroscientific breakthroughs and neural decoding, thereby improving the lives of patients with severe neurological disorders.Key ResponsibilitiesCreate adaptable and efficient ML infrastructure:Design and implement ML cloud infrastructure for extensive modeling and analytics.Facilitate diverse model exploration, hyperparameter tuning, pretraining, fine-tuning, and evaluation.Develop and refine scalable distributed training pipelines, incorporating model sharding, cross-GPU communication, and real-time training monitoring.Manage and sustain robust ML platforms and services throughout the model lifecycle.Make strategic architecture decisions balancing performance, cost, reliability, and scalability.Build flexible and scalable data platforms:Design and optimize large-scale databases and data pipelines to ensure reliable data access.

Jan 29, 2026

Apply

Software Engineer - Machine Learning Infrastructure at Whatnot | San Francisco

Whatnot

Full-time|Remote|San Francisco, CA

Whatnot is a livestream shopping marketplace connecting buyers and sellers across categories such as trading cards, fashion, electronics, and live plants. The platform supports sellers in building real businesses and is shaping live commerce at scale in North America and Europe. The team works as a distributed group with members based in the US, UK, Ireland, Poland, Germany, and Australia. Agility, user focus, and meaningful work are central to the company culture. Whatnot has been recognized among the fastest growing marketplaces and was named the #1 Best Startup Employer in America by Forbes. Learn more about Whatnot through these resources: Core Values NYT: Fastest Growing Marketplaces Forbes: Best Startup Employer Whatnot News Engineering Blog Role overview The Software Engineer - Machine Learning Infrastructure role centers on building and improving the systems that support machine learning and self-hosted large language model applications at Whatnot. This position involves close collaboration with machine learning scientists to bring advanced models into production, directly enabling new product features and experiences. What you will do Design and develop infrastructure for reliable and efficient machine learning at scale Work on low-latency serving of large models Support distributed training and high-throughput GPU inference Help advance Whatnot’s mission to unlock new capabilities through AI and machine learning Location San Francisco, CA

Apr 22, 2026

Apply

Machine Learning Infrastructure Engineer

Mach9

Full-time|On-site|San Francisco

Mach9’s Machine Learning Infrastructure Engineers create and maintain the backbone for production AI models used in civil engineering and surveying. The team manages a machine learning pipeline that processes over 10,000 miles of labeled survey data, supports image segmentation networks, and runs 3D prediction models. These systems deliver real-time inference capabilities directly to surveyors and engineers working in the field. Role overview This position is designed for mid-career engineers with a strong background in both training and inference aspects of machine learning infrastructure. The work involves handling large-scale data and ensuring reliable performance for demanding, real-world applications. What you will do Build and improve training pipelines for deep transformer models using hundreds of terabytes of 3D point cloud and image data. Design and implement inference infrastructure to support both offline detection algorithms and responsive, real-time inference integrated with CAD software. Location Based in San Francisco.

Apr 25, 2026

Apply

Machine Learning Infrastructure Engineer - Supercomputing

Physical Intelligence

Full-time|On-site|San Francisco

At Physical Intelligence, we are pioneering general-purpose AI applications for the physical world. Our innovative approach involves orchestrating thousands of accelerators across a diverse ecosystem of GPU and TPU clusters, which encompass various hardware generations, cloud platforms, and cluster configurations.Researchers frequently encounter challenges in identifying the optimal cluster for their tasks, understanding resource availability, and configuring their workloads efficiently. This process is not scalable. To enhance productivity, we require an intelligent scheduling and compute system that can automatically determine the best job placements based on availability, hardware compatibility, cost considerations, and priority levels, allowing researchers to concentrate on their scientific endeavors.This position encompasses the complete ownership of this challenge: the development of scheduling systems, placement logic, cluster management frameworks, and operational tools essential for seamless operations.This role is distinct from traditional cloud DevOps; it focuses on resource allocation intelligence, utilization efficiency, fault tolerance, and ensuring a smooth experience for large-scale distributed training.About the TeamThe ML Infrastructure team is dedicated to bolstering and accelerating Physical Intelligence’s fundamental modeling initiatives by creating systems that ensure large-scale training is reliable, reproducible, and efficient. You will collaborate closely with the ML Infrastructure, data platform, and research teams to eliminate compute scheduling as a bottleneck.Key Responsibilities- Lead Intelligent Job Scheduling and Placement: Design and implement multi-tenant scheduling systems that automatically allocate training jobs to the most suitable cluster based on hardware specifications, topology, availability, cost, and priority. Facilitate equitable resource sharing across teams and projects through quota management, priority tiers, and preemption policies. Simplify cluster discrepancies so researchers can submit jobs without needing detailed knowledge of cluster specifics.- Enhance Multi-cluster Orchestration: Develop the control plane responsible for overseeing the job lifecycle across various clusters (including mixed GPU/TPU setups, multi-generational hardware, both on-premises and cloud-based) and enable effortless job migration, failover, and rescheduling.- Optimize Accelerator Utilization and Performance: Continuously monitor and enhance GPU/TPU usage across the entire fleet. Apply priority, preemption, queuing, and fairness strategies that balance research momentum with cost efficiency.- Guarantee Scalability and Stability: Implement fault detection, automatic recovery mechanisms, and resilience strategies for long-running multi-node training tasks. Oversee health checks, node management, and scaling strategies to ensure optimal performance.

Mar 7, 2026

Apply

Machine Learning Infrastructure Engineer

Physical Intelligence

Full-time|On-site|San Francisco

As a Machine Learning Infrastructure Engineer at Physical Intelligence, you will play a vital role in enhancing and optimizing our training systems and core model code. You will take ownership of critical infrastructure for large-scale training, which includes managing GPU/TPU compute, orchestrating jobs, and developing reusable and efficient JAX training pipelines. Collaborating closely with researchers and model engineers, you will help transform innovative ideas into experiments and subsequently into production training runs.This position is hands-on and offers significant leverage at the intersection of machine learning, software engineering, and scalable infrastructure.The TeamOur ML Infrastructure team is dedicated to supporting and accelerating Physical Intelligence's core modeling initiatives by building systems that ensure large-scale training is reliable, reproducible, and efficient. The team collaborates with research, data, and platform engineers to guarantee that models can seamlessly transition from prototype to production-grade training runs.Key Responsibilities- Manage training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, which includes scheduling, job management, checkpointing, and performance metrics/logging.- Expand distributed training: Collaborate with researchers to efficiently scale JAX-based training across TPU and GPU clusters.- Enhance performance: Profile and optimize memory usage, device utilization, throughput, and distributed synchronization to maximize efficiency.- Facilitate rapid iteration: Develop abstractions for launching, monitoring, debugging, and reproducing experiments.- Oversee compute resources: Ensure optimal allocation and utilization of cloud-based GPU/TPU compute resources while managing costs effectively.- Collaborate with researchers: Translate research requirements into infrastructure capabilities and promote best practices for large-scale training.- Contribute to core training code: Evolve the JAX model and training code to accommodate new architectures, modalities, and evaluation metrics.

Aug 24, 2024

Apply

Director of Machine Learning Engineering & Infrastructure

Tubi Inc.

Full-time|$292K/yr - $417.2K/yr|Hybrid|San Francisco, CA; Los Angeles, CA; New York, NY (Hybrid); USA - Remote

About the Role:The Machine Learning team at Tubi is at the forefront of transforming user experiences through cutting-edge technology. With the industry's largest inventory and a vast audience of millions, we are dedicated to solving complex challenges in recommendations, search, content understanding, and ad optimization, shaping the future of streaming.We are on the lookout for a Director of Machine Learning Engineering and Infrastructure to spearhead a hybrid team that merges advanced ML engineering with exceptional infrastructure design. In this pivotal role, you will define the strategic vision and implementation for scaling our machine learning capabilities, ensuring our distributed systems and infrastructure can foster innovation on a grand scale. You will blend technical expertise with outstanding leadership to guide teams in delivering robust ML systems and high-performance distributed services.

Mar 17, 2026

Apply

Senior Software Engineer - Infrastructure

Hover

Full-time|$194K/yr - $239K/yr|On-site|san_francisconew_york

At Hover, we empower individuals to conceptualize, enhance, and safeguard the spaces they cherish. Utilizing proprietary AI and over a decade's worth of real property data, we provide answers to pivotal questions such as, 'What will it look like?' and 'What will it cost?' Our platform offers homeowners, contractors, and insurance professionals accurately measured, interactive 3D models of properties — all achievable from a smartphone scan in mere minutes.Driven by curiosity and purpose, we maintain a strong commitment to our customers, communities, and one another. We believe that diverse perspectives foster the best ideas, and we take pride in nurturing an inclusive, high-performance culture that encourages growth, accountability, and excellence. Supported by premier investors like Google Ventures and Menlo Ventures, and trusted by industry leaders such as Travelers, State Farm, and Nationwide, we are revolutionizing how individuals perceive and interact with their environments.About the RoleAs a Senior Software Engineer specializing in Infrastructure, you will delve into cloud infrastructure challenges unique to a company focused on 3D data, computer vision, and machine learning. Your enthusiasm for building internal tools and your talent for crafting elegant solutions to complex issues will be crucial in this role.Our Infrastructure team is responsible for everything beyond the application binary, serving as a critical partner to the rest of the engineering department. Through automation, we aim to streamline processes, ensuring that the simplest path is also the fastest and most secure. We manage and optimize all cloud infrastructure components including our Kubernetes environment, databases, networks, storage, and caching systems. Collaborating with engineering peers, we establish consistent solutions to common architectural challenges, particularly those involving rich geospatial and machine learning workloads. We are well-versed in best practices for cloud architecture and CI/CD, leveraging application development as a means to implement these practices.Your ContributionsYou will play a pivotal role in developing straightforward solutions to intriguing problems, thereby enhancing the foundation upon which our engineering teams build. Collaborating closely with engineers across the organization, you will help make their applications faster, easier to manage, and more reliable in production. Your work will span frontend, backend, computer vision, data, security, and machine learning teams to scale new ideas into production effectively. Given the small and highly collaborative nature of our team, you can expect a varied and impactful workload, which may include:Designing scalable cloud architectureEnhancing CI/CD pipelines and developer tooling

Mar 11, 2026

Apply

Machine Learning Infrastructure Engineer

Causal Labs

Full-time|On-site|San Francisco

At Causal Labs, we are on a groundbreaking mission to develop general causal intelligence, harnessing AI to (1) forecast future events and (2) pinpoint optimal actions to influence that future.To realize this vision, we are constructing a Large Physics foundation Model (LPM), as the domains governed by physics inherently feature cause-and-effect relationships, which is distinct from visual or textual data.Weather serves as the perfect training environment for our LPM, being the most extensively observed physical system and providing rapid, objective ground truth feedback from sensory data at an unprecedented scale, far exceeding what is utilized for current large language models (LLMs).Our team comprises elite researchers and engineers with backgrounds in self-driving technology, drug discovery, and robotics, including talents from Google DeepMind, Cruise, Waymo, Meta, Nabla Bio, and Apple. We believe that achieving general causal intelligence will be a pivotal technological advancement for humanity.We are searching for infrastructure engineers who are eager to tackle formidable challenges and contribute to our mission.Your expertise in distributed training clusters and performance optimization for large models will be crucial as we address our training and inference challenges. If you possess experience in developing large-scale ML infrastructure within fields like language models, vision systems, robotics, or biology, we invite you to join us.

Oct 29, 2025

Apply

Software Engineer - Infrastructure

Baseten

Full-time|$300K/yr - $300K/yr|On-site|San Francisco

ABOUT BASETENJoin Baseten, where we drive mission-critical AI inference for leading companies like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer. Our unique blend of applied AI research, robust infrastructure, and intuitive developer tools empowers organizations at the forefront of AI innovation to deploy state-of-the-art models into production. Recently, we secured a $300M Series E funding round, backed by esteemed investors such as BOND, IVP, Spark Capital, Greylock, and Conviction. Be a part of our rapid growth and help shape the platform that engineers trust for launching AI products.THE ROLEAs an Infrastructure Software Engineer at Baseten, you will play a pivotal role in developing and maintaining our ML inference platform that powers AI applications in production. Your contributions will enhance the core infrastructure, enabling developers to deploy, scale, and monitor machine learning models with exceptional performance.EXAMPLE INITIATIVESYou will engage in innovative projects within our Infrastructure team, including:Multi-cloud capacity managementInference on B200 GPUsMulti-node inferenceFractional H100 GPUs for efficient model servingRESPONSIBILITIESDesign and develop infrastructure components for our ML inference platform, primarily using Python and Go.Implement and maintain Kubernetes deployments for optimal model serving.Contribute to the orchestration layer for model deployments.Build and enhance monitoring systems to track model performance metrics effectively.Develop efficient resource management solutions to optimize performance.

Mar 9, 2025

Apply

Senior Software Engineer, Machine Learning (Commerce)

Discord Inc.

Full-time|$220K/yr - $247.5K/yr|On-site|San Francisco Bay Area

Discord is a vibrant platform utilized by over 200 million individuals each month for various activities, with one common passion uniting them: gaming. An astounding 90% of our users engage in gaming, collectively dedicating 1.5 billion hours to a diverse array of titles on Discord every month. We envision Discord as a pivotal player in the future of gaming, dedicated to enhancing communication and camaraderie among players before, during, and after their gaming sessions.We are in search of a Senior Machine Learning Engineer to become a key member of our Revenue ML team at Discord. This role is strategically positioned at the crossroads of our two primary revenue initiatives — our expanding first-party Shop and our newly introduced Game Commerce platform, which connects players to in-game items from renowned publishers such as Marvel Rivals, Fortnite, Valorant, and many others. You will be the pioneering ML expert for commerce discovery and personalization, constructing systems from scratch that will drive recommendations, social commerce features, and targeted marketing across both our first-party and third-party storefronts.This position is high-impact and offers significant leverage. Discord’s social ecosystem provides us with a distinct commercial advantage — robust social graphs, enthusiastic fan communities, and an inherent gaming context — and you will be the visionary who transforms this into ML-driven products that generate substantial GMV growth.Key Responsibilities:Develop and oversee the ML infrastructure for commerce discovery, including user, item, and interaction embeddings that facilitate personalized recommendations across various shop interfaces (homepage, cart, post-purchase, wishlist, etc.).Create and implement scalable real-time recommendation and ranking systems that efficiently manage a growing catalog of first-party and third-party items from diverse game publishers.Build ML-enhanced marketing targeting systems that accurately identify the ideal users for tailored campaigns — such as new buyer discounts, drop campaigns, weekly deals, and seasonal promotions — driving conversion rates without conditioning users to expect discounts.Utilize Discord's unique social graph to innovate social commerce ML applications: predicting gifting recipients, modeling group buying conversions, and generating friend-group recommendations that set Discord apart from traditional game storefronts.Lead the development of deep learning A/B testing infrastructure and model monitoring to convert experimentation insights into actionable product strategies.Collaborate closely with Shop, Game Commerce, Revenue Infra, ML Infra, and Data Engineering teams to outline ML requirements, identify integration points, and influence the commerce roadmap.

Mar 6, 2026

Apply

Machine Learning Infrastructure Engineer

TRM Labs

Full-time|$220K/yr - $220K/yr|On-site|San Francisco, CA

Join Us in Building a Safer Financial System.At TRM Labs, we are at the forefront of blockchain analytics and AI technology, dedicated to empowering law enforcement, national security, financial institutions, and cryptocurrency businesses in the fight against crypto-related fraud and financial crime. Our advanced platforms leverage blockchain intelligence and AI to trace the flow of funds, identify illicit activities, build robust cases, and provide a comprehensive understanding of threats. Trusted globally, TRM Labs is committed to creating a safer and more secure environment for everyone.Our mission is to develop an innovative financial system that benefits billions around the globe. By integrating threat intelligence with machine learning, our next-generation platform enables institutions and governments to detect cryptocurrency fraud and financial crimes on an unmatched scale.As a Machine Learning Infrastructure Engineer at TRM Labs, you will collaborate with a talented team of data scientists, engineers, and product managers. Your role will involve designing and maintaining scalable GPU-powered infrastructure that supports our AI systems. You will work at the intersection of distributed systems, cloud infrastructure, and applied machine learning, laying the groundwork for high-throughput, production-level ML workloads.

Feb 25, 2026

Apply

Software Engineer, Machine Learning Infrastructure & Distributed Systems (Staff & Principal)

Tubi, Inc.

Full-time|$227.2K/yr - $417K/yr|Hybrid|San Francisco, CA; Los Angeles, CA; New York, NY (Hybrid); USA - Remote

About the Role:Join our dynamic ML Infrastructure team as a Software Engineer, where you'll collaborate intimately with the Machine Learning and Product teams to construct top-tier machine learning inference platforms. These cutting-edge platforms drive vital services such as personalized recommendations, search functionalities, and content comprehension at Tubi.Your primary focus will be on the development and maintenance of low-latency ML model serving systems that cater to Deep Learning, LLM, and Search models. This will include the creation of self-service infrastructure and critical components such as the inference engine, feature store, vector store, and experimentation engine.In this role, you'll enhance our service deployment and operational processes, with opportunities to contribute to open-source projects. Enjoy architectural freedom to explore innovative frameworks, spearhead significant cross-functional projects, and elevate the capabilities of our ML and Product teams.We are currently hiring for two positions:Staff Software EngineerPrincipal Software EngineerAdditional Details: As a Principal Engineer, you will serve as a technical leader and visionary, guiding the advancement of our machine learning platform. You'll address complex technical challenges, shape architectural decisions, and mentor senior engineers, fostering a culture of excellence and continuous improvement. Your contributions will impact millions of users.

Mar 23, 2026

Create account — see all 7,374 results