Training Performance Engineer jobs in San Francisco – Browse 5,278 openings on RoboApply Jobs

Training Performance Engineer

OpenAISan Francisco

Hybrid Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Candidates should possess a strong background in computer science, software engineering, or a related field, with extensive experience in performance engineering within distributed systems. Proficiency in GPU architecture, parallel programming, and performance profiling tools is essential. Familiarity with machine learning concepts and distributed training frameworks will be highly advantageous. Strong analytical and problem-solving skills, alongside an ability to collaborate effectively with cross-functional teams, are crucial for success in this role.

About the job

About Our Team
The Training Runtime team is at the forefront of developing a cutting-edge distributed machine learning training runtime, enabling everything from pioneering research to large-scale model deployments. Our mission is to empower researchers while facilitating growth into frontier-scale operations. We are crafting a cohesive, modular runtime that adapts to researchers’ evolving needs as they progress along the scaling curve.

Our focus is anchored in three key areas: optimizing high-performance, asynchronous data movement that is aware of tensor and optimizer states; building robust, fault-tolerant training frameworks that incorporate comprehensive state management, resilient checkpointing, deterministic orchestration, and advanced observability; and managing distributed processes for enduring, job-specific, and user-defined workflows.

We aim to seamlessly integrate proven large-scale capabilities into a developer-friendly runtime, enabling teams to iterate rapidly and operate reliably across various scales. Our success is gauged by both the enhancement of training throughput (the speed of model training) and researcher throughput (the pace at which ideas transform into experiments and products).

About the Role
As a Training Performance Engineer, you will be instrumental in driving efficiency enhancements throughout our distributed training architecture. Your responsibilities will include analyzing extensive training runs, pinpointing utilization gaps, and engineering optimizations that maximize throughput and system uptime. This position merges a profound understanding of systems with practical performance engineering—analyzing GPU kernel performance, collective communication throughput, and investigating I/O bottlenecks, while also implementing model sharding techniques for large-scale training.

Your efforts will ensure our clusters operate at peak performance, enabling OpenAI to develop larger and more sophisticated models within existing compute budgets.

This position is located in San Francisco, CA, utilizing a hybrid work model with three days in the office each week, and we offer relocation assistance for new hires.

Key Responsibilities:

Analyze end-to-end training runs to detect performance bottlenecks across computation, communication, and storage.
Enhance GPU utilization and throughput for large-scale distributed model training.
Collaborate with runtime and systems engineers to boost kernel efficiency, scheduling, and collective communication performance.
Implement model graph transformations to enhance overall throughput.
Develop tools for monitoring and visualizing metrics such as MFU, throughput, and uptime across clusters.

About OpenAI

OpenAI is a pioneering organization dedicated to advancing digital intelligence in a way that is safe and beneficial for humanity. We are committed to conducting cutting-edge research and developing innovative technologies that push the boundaries of what's possible in the field of artificial intelligence. Our diverse team of experts strives to create an inclusive environment where creativity and collaboration thrive, enabling us to tackle some of the most complex challenges in AI.

Similar jobs

1 - 20 of 5,278 Jobs

Select all on this page (20)

Apply

Training Performance Engineer

OpenAI

Full-time|Hybrid|San Francisco

About Our TeamThe Training Runtime team is at the forefront of developing a cutting-edge distributed machine learning training runtime, enabling everything from pioneering research to large-scale model deployments. Our mission is to empower researchers while facilitating growth into frontier-scale operations. We are crafting a cohesive, modular runtime that adapts to researchers’ evolving needs as they progress along the scaling curve.Our focus is anchored in three key areas: optimizing high-performance, asynchronous data movement that is aware of tensor and optimizer states; building robust, fault-tolerant training frameworks that incorporate comprehensive state management, resilient checkpointing, deterministic orchestration, and advanced observability; and managing distributed processes for enduring, job-specific, and user-defined workflows.We aim to seamlessly integrate proven large-scale capabilities into a developer-friendly runtime, enabling teams to iterate rapidly and operate reliably across various scales. Our success is gauged by both the enhancement of training throughput (the speed of model training) and researcher throughput (the pace at which ideas transform into experiments and products).About the RoleAs a Training Performance Engineer, you will be instrumental in driving efficiency enhancements throughout our distributed training architecture. Your responsibilities will include analyzing extensive training runs, pinpointing utilization gaps, and engineering optimizations that maximize throughput and system uptime. This position merges a profound understanding of systems with practical performance engineering—analyzing GPU kernel performance, collective communication throughput, and investigating I/O bottlenecks, while also implementing model sharding techniques for large-scale training.Your efforts will ensure our clusters operate at peak performance, enabling OpenAI to develop larger and more sophisticated models within existing compute budgets.This position is located in San Francisco, CA, utilizing a hybrid work model with three days in the office each week, and we offer relocation assistance for new hires.Key Responsibilities:Analyze end-to-end training runs to detect performance bottlenecks across computation, communication, and storage.Enhance GPU utilization and throughput for large-scale distributed model training.Collaborate with runtime and systems engineers to boost kernel efficiency, scheduling, and collective communication performance.Implement model graph transformations to enhance overall throughput.Develop tools for monitoring and visualizing metrics such as MFU, throughput, and uptime across clusters.

Oct 16, 2025

Apply

Performance Modeling Engineer II

OpenAI

Full-time|On-site|San Francisco

Role overview The Performance Modeling Engineer II position at OpenAI centers on building and applying performance models to enhance the efficiency of advanced AI systems. Based in San Francisco, this role contributes to the reliability and speed of OpenAI’s technologies. What you will do Develop and implement performance models for AI systems Collaborate with data scientists and engineers to refine performance metrics Support the efficiency and rigorous standards of OpenAI’s technologies

Apr 20, 2026

Apply

Senior Systems Performance Engineer

Crusoe

Full-time|On-site|San Francisco, CA - US

Join Crusoe as a Senior Systems Performance Engineer, where you will play a crucial role in optimizing and enhancing our systems for superior performance. You will be responsible for diagnosing performance bottlenecks, implementing solutions, and ensuring that our infrastructure can scale efficiently. Work in a dynamic environment that encourages innovation and professional growth.

Mar 18, 2026

Apply

Senior ML Performance Engineer

Lemurian Labs

Full-time|On-site|SF Bay Area

About UsAt Lemurian Labs, we are dedicated to democratizing AI technology while prioritizing sustainability. Our mission is to create solutions that minimize environmental impact, ensuring that artificial intelligence serves humanity positively. We are committed to responsible innovation and the sustainable growth of AI.We are in the process of developing a state-of-the-art, portable compiler that empowers developers to 'build once, deploy anywhere.' This technology ensures seamless cross-platform integration, allowing for model training in the cloud and deployment at the edge, all while maximizing resource efficiency and scalability.If you are passionate about scaling AI sustainably and are eager to make AI development more powerful and accessible, we invite you to join our team at Lemurian Labs. Together, we can build a future that is innovative and responsible.The RoleWe are seeking a Senior ML Performance Engineer to take charge of designing and leading our Performance Testing Platform from inception. In this pivotal role, you will be recognized as the technical expert in measuring, validating, and enhancing the performance of large language models (including Llama 3.2 70B, DeepSeek, and others) prior to and following compiler optimization on cutting-edge GPU architectures.This is a critical position that will significantly impact our product quality and customer success. You will work at the intersection of Machine Learning systems, GPU architecture, and performance engineering, constructing the infrastructure that substantiates the value of our compiler.

Oct 31, 2025

Apply

Performance Modeling Engineer

OpenAI

Full-time|Remote|San Francisco

OpenAI is seeking a Performance Modeling Engineer based in San Francisco. This role centers on building and improving models that enhance the performance and efficiency of AI systems. The work directly supports the technical backbone of OpenAI’s products. Key responsibilities Develop and refine models aimed at optimizing the performance of AI systems. Collaborate with engineers and data scientists to tackle technical challenges as they arise. Contribute to projects that improve the efficiency of large-scale AI infrastructure. Role overview This position offers the chance to work on foundational technology that underpins OpenAI’s products. The focus is on practical improvements and close teamwork with technical colleagues to advance the capabilities and efficiency of AI at scale.

Apr 20, 2026

Apply

ChatGPT Performance Engineer

OpenAI

Full-time|On-site|San Francisco

Role Overview OpenAI is hiring a ChatGPT Performance Engineer in San Francisco. This role focuses on improving the performance and efficiency of ChatGPT’s advanced AI models. The position works closely with cross-functional teams to identify and implement solutions that make ChatGPT faster and more reliable for users around the world. What You Will Do Optimize the speed, reliability, and scalability of ChatGPT’s platforms. Collaborate with engineers and other teams to solve technical challenges. Develop and refine systems to support a seamless user experience globally. Impact This work directly shapes the future of AI at OpenAI, helping deliver a dependable and efficient ChatGPT experience to millions of users.

Apr 15, 2026

Apply

Performance Engineer - Immediate Openings for Local Candidates

usm2

Contract|On-site|San Francisco

We are seeking a talented Performance Engineer to join our dynamic team at usm2. This is an exciting opportunity for local professionals who are passionate about optimizing system performance and enhancing user experience. As a Performance Engineer, you will play a crucial role in analyzing performance metrics, identifying bottlenecks, and implementing solutions to ensure our applications run smoothly and efficiently.

May 18, 2017

Apply

GPU Performance Engineer

Genmo

Full-time|On-site|San Francisco HQ

At Genmo, we are at the forefront of advancing artificial intelligence through innovative research in video generation. Our mission is to construct open, cutting-edge models that will ultimately contribute to the realization of Artificial General Intelligence (AGI). As part of our dynamic team, you will play a pivotal role in redefining the future of AI and expanding the horizons of video creation.We are looking for a skilled GPU Performance Engineer who can extract maximum performance from our H100 infrastructure and fine-tune our model serving stack to achieve unparalleled efficiency. If you are passionate about optimizing performance, particularly at the microsecond level, and thrive on pushing hardware to its limits, this is the perfect opportunity for you.Key ResponsibilitiesUtilize advanced profiling tools such as Nsight Systems and nvprof to analyze and enhance GPU workloads.Develop high-performance CUDA and Triton kernels to optimize essential model functions.Reduce cold start latency from seconds to mere milliseconds in our serving infrastructure.Optimize memory access patterns, implement kernel fusion, and maximize GPU utilization.Collaborate closely with machine learning engineers to optimize model implementations.Diagnose and resolve performance issues throughout the application and hardware stack.Implement custom memory pooling and allocation strategies to enhance performance.Promote performance optimization techniques and foster a culture of excellence across teams.

Jul 17, 2025

Apply

Senior Infrastructure and Performance Engineer

Nash

Full-time|On-site|San Francisco

Senior Infrastructure & Performance EngineerAs a Senior Infrastructure & Performance Engineer, you will take charge of enhancing the performance, reliability, and scalability of Nash's foundational infrastructure. Collaborating closely with the Engineering Leadership and both platform and product engineering teams, you will design and manage low-latency, mission-critical systems that facilitate real-time logistics for some of the world's largest retailers.This is a key senior role focused on elastic capacity, high availability, cloud-native architectures, Postgres performance, and enterprise-grade CI/CD for multi-region deployments. You will define the technical roadmap, establish best practices, and implement systems that support the essential workflows of major retailers.Key ResponsibilitiesOversee infrastructure performance and reliability for Nash's production environments, ensuring low latency, high throughput, and consistent performance under load.Design, develop, and enhance AWS infrastructure, utilizing managed services with a focus on ECS/Fargate.Lead initiatives in Postgres performance engineering, including query optimization, indexing strategies, connection management, replication, cluster design, and failover.Architect and maintain multi-region, highly available systems with robust resiliency and guaranteed disaster recovery.Design and refine enterprise-grade CI/CD pipelines that enable safe, repeatable, and rapid deployments across environments and regions.Establish observability standards (metrics, logs, tracing, SLOs) to proactively identify and resolve performance bottlenecks.Collaborate with application engineers to inform system design choices that influence scalability, latency, and reliability.Lead incident response efforts and postmortems, emphasizing root cause analysis, systemic improvements, and long-term resilience.Set best practices for infrastructure and performance while mentoring engineers throughout the organization.Qualifications6+ years of experience in building and managing high-scale production infrastructure for mission-critical systems.Proficiency with AWS, particularly with ECS/Fargate, and experience with cloud-native architecture.Strong background in Postgres performance tuning and optimization.Deep understanding of CI/CD practices and experience in multi-region deployments.Exceptional analytical and problem-solving skills, with a proactive approach to performance management.

Jan 6, 2026

Apply

Senior Frontend Engineer - Performance Optimization

ClickUp

Full-time|On-site|United States of America

At ClickUp, we're not just developing software; we're shaping the future of work! In an era dominated by work sprawl, we identified a more efficient way. This led us to create the first truly integrated AI workspace, consolidating tasks, documents, chat, calendar, and enterprise search, all enhanced by context-driven AI. Our mission is to empower millions of teams to escape silos, reclaim their time, and reach unprecedented levels of productivity. At ClickUp, you'll have the chance to learn, innovate, and leverage AI in transformative ways that will not only influence our product but also the broader landscape of work itself. Join a daring, pioneering team that's challenging the limits of what's possible! We are on the lookout for a technical leader in SaaS client performance who is passionate about enhancing the customer experience through top-tier performance solutions. As a Senior Performance Engineer, you will spearhead comprehensive strategies to optimize application speed, memory utilization, and reliability across our entire platform. You will be empowered to analyze, diagnose, and address performance bottlenecks wherever they arise—be it front-end, back-end, or infrastructure—ensuring ClickUp remains the fastest and most reliable productivity platform available.The ideal candidate is a hands-on authority in browser and NodeJS performance, with a thorough understanding of how code influences rendering, memory management, and overall user experience. You excel in solving intricate challenges, collaborating across teams, and establishing new benchmarks for performance excellence. If you're driven to make a significant impact for millions of users, this is your chance to lead at scale.Your Responsibilities:Conduct root cause analysis on client performance issues and perform post-mortems.Profile application code to identify inefficient algorithms, memory leaks, and other issues; propose and implement effective solutions.Establish performance monitoring, alerting, and dashboards to proactively detect and resolve client performance challenges.Examine client traffic patterns, load testing outcomes, and other metrics to set benchmarks and drive enhancements.Champion performance best practices and set performance standards across the engineering organization.Identify infrastructure upgrades (caching, CDNs, database optimization) to elevate the client experience.Collaborate with development teams to incorporate performance as a core requirement in the development of new features.

Dec 22, 2025

Apply

Senior Software Engineer - Performance

Databricks

Full-time|$166K/yr - $225K/yr|On-site|San Francisco, California

P-97 At Databricks, we are dedicated to empowering data teams to tackle some of the most challenging problems in the world. We achieve this by creating and managing a leading data and AI infrastructure platform that enables our clients to leverage deep data insights for business enhancement. Our commitment to pushing the limits of data and AI technology is matched by our focus on resilience, security, and scalability, which are essential for our customers' success on our platform. Databricks operates one of the largest-scale software platforms, comprising millions of virtual machines that generate terabytes of logs and process exabytes of data daily. Given our scale, we frequently encounter cloud hardware, network, and operating system faults, and our software must adeptly protect our customers from these issues. As a Senior Performance Engineer, you will collaborate with various teams throughout the organization to assess product and feature performance, pinpoint performance bottlenecks, and partner with engineers to address performance and scalability challenges. This includes setting performance goals for different software releases, guiding teams in developing performance benchmarks, conducting competitive benchmark analyses for various Databricks products, and performing in-depth analyses to identify and resolve performance issues.

Jan 30, 2026

Apply

Engineering Manager - Model Performance

Baseten

Full-time|On-site|San Francisco

ABOUT BASETENAt Baseten, we empower the most innovative AI companies—such as Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer—by providing a robust platform for mission-critical inference. Our unique combination of applied AI research, adaptable infrastructure, and cutting-edge developer tools allows companies at the forefront of AI to deploy state-of-the-art models seamlessly. Having recently secured a $300M Series E funding round from notable investors including BOND, IVP, Spark Capital, Greylock, and Conviction, we are poised for rapid growth. Join us in creating the essential platform for engineers to launch AI products.THE ROLEAre you driven to push the boundaries of artificial intelligence while leading a team of talented engineers? We are seeking a Technical Lead Manager with a focus on machine learning performance and inference. This position is perfect for an individual with a strong engineering foundation who is eager to guide and mentor a team while remaining actively engaged in hands-on technology work. If you excel in a dynamic startup atmosphere and are excited to tackle both leadership and technical challenges, we invite you to apply.EXAMPLE INITIATIVESAs a member of our Model Performance team, you will work on projects such as:Baseten Embeddings Inference: The fastest embeddings solution availableThe Baseten Inference StackDriving model performance optimizationRESPONSIBILITIESLead, mentor, and manage a team of engineers dedicated to developing and optimizing ML model inference and performance.Oversee technical strategy and architectural decisions, fostering improvements across our engineering organization.Collaborate with cross-functional teams to ensure the seamless integration and scalability of ML models in production settings.Drive innovation in model performance and advocate for best practices within the team.

Sep 12, 2024

Apply

Engineering Manager - Video Performance

Canva

Full-time|On-site|San Francisco

Join Canva as an Engineering Manager specializing in Video Performance, where you'll lead a talented team dedicated to enhancing our video features. You will play a pivotal role in driving innovation, optimizing video processing, and ensuring exceptional performance for our users.This is an exciting opportunity for a motivated leader who thrives in a fast-paced environment and is passionate about delivering high-quality video experiences.

Mar 6, 2026

Apply

Workload Porting & Performance Engineer

OpenAI

Full-time|On-site|San Francisco

Role overview OpenAI seeks a Workload Porting & Performance Engineer based in San Francisco. This position centers on optimizing advanced computing systems and ensuring workloads run efficiently on OpenAI’s technology stack. The engineer will help drive strong performance across multiple platforms. What you will do Port workloads to OpenAI’s computing systems Analyze system performance and identify areas for improvement Apply technical expertise to support the development of AI applications Impact This work shapes the performance of AI applications and advances the field. The results directly contribute to OpenAI’s mission to develop and deploy powerful AI technologies.

Apr 20, 2026

Apply

Senior Software Engineer - Network Performance & Reliability

Cloudflare, Inc.

Full-time|Hybrid|Hybrid

Join Cloudflare as a Senior Software Engineer specializing in Network Performance & Reliability! In this role, you'll be at the forefront of enhancing the performance and stability of our global network, ensuring our customers benefit from unparalleled speed and reliability. You'll collaborate with experts across various teams to design and implement innovative solutions that optimize network operations.

Mar 11, 2026

Apply

Performance Engineer - Member of Technical Staff, Kernel Engineering

Inferact

Full-time|$200K/yr - $400K/yr|Remote|San Francisco

At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, significantly enhancing the speed and reducing the cost of AI inference. Our founders, the visionaries behind vLLM, have spent years bridging the gap between advanced models and cutting-edge hardware.About the RoleWe are seeking a skilled performance engineer dedicated to maximizing the computational efficiency of modern accelerators. In this role, you'll develop kernels and implement low-level optimizations that position vLLM as the fastest inference engine globally. Your contributions will be pivotal as your code will execute across a broad spectrum of hardware accelerators, from NVIDIA GPUs to the latest silicon innovations. You'll collaborate closely with hardware vendors to ensure we fully leverage the capabilities of each new generation of chips.

Jan 22, 2026

Apply

Software Engineer - Model Performance

Baseten

Full-time|On-site|San Francisco

ABOUT BASETENBaseten is at the forefront of AI technology, empowering leading-edge companies like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer to seamlessly integrate advanced AI models into their operations. Our unique blend of applied AI research, adaptable infrastructure, and intuitive developer tools enables innovators to bring their most ambitious AI products to life. With our recent $300M Series E funding from top-tier investors such as BOND, IVP, Spark Capital, Greylock, and Conviction, we are poised for rapid growth. Join us in shaping the platform that engineers rely on to deploy transformative AI solutions.THE ROLEAre you driven by a passion for enhancing artificial intelligence applications? We are seeking a proactive Software Engineer specializing in ML performance to join our energetic team. This position is perfect for backend engineers who thrive in a fast-paced startup environment and are eager to make substantial contributions to the realm of Large Language Model (LLM) Inference. If you're enthusiastic about optimizing open-source ML models, we can't wait to hear from you!EXAMPLE INITIATIVESAs a member of our Model Performance team, you will have the opportunity to work on exciting projects, including:Baseten Embeddings Inference: The quickest embeddings solution availableThe Baseten Inference StackDriving model performance optimizationRESPONSIBILITIESDevelop, refine, and implement advanced techniques (quantization, speculative decoding, kv cache reuse, chunked prefill, and LoRA) for ML model inference and infrastructure.Conduct thorough investigations into the codebases of TensorRT, PyTorch, TensorRT-LLM, vllm, sglang, CUDA, and other libraries to troubleshoot and resolve ML performance issues.Scale and apply optimization techniques across a diverse array of ML models, with a focus on large language models.

Mar 28, 2024

Apply

Research Engineer in Performance Reinforcement Learning

Anthropic

Full-time|On-site|San Francisco, CA

Join the innovative team at Anthropic as a Research Engineer specializing in Performance Reinforcement Learning. In this role, you will contribute to cutting-edge research that directly influences the development of advanced AI systems. Collaborate with a talented group of engineers and researchers, leveraging your expertise to enhance our algorithms and improve overall performance.

Mar 23, 2026

Apply

Software Engineer, Inference - Performance Optimization

OpenAI

Full-time|On-site|San Francisco

Role overview This Software Engineer position at OpenAI focuses on inference and performance optimization. Based in San Francisco, the role centers on increasing the speed and efficiency of advanced AI systems. Collaboration with experienced engineers is a key part of the work, with an emphasis on refining AI performance. What you will do Work on optimizing the performance of AI inference systems Collaborate with other engineers to improve efficiency and speed Contribute to solutions that enhance AI system capabilities Location This role is based in San Francisco.

Apr 25, 2026

Apply

Software Engineer - Productivity and Model Performance

OpenAI

Full-time|On-site|San Francisco

OpenAI is seeking a Software Engineer in San Francisco to focus on improving productivity by optimizing model performance. This position centers on developing solutions that make machine learning models more efficient and effective. Role overview This role involves working closely with teams across different functions to identify and address areas where model performance can be improved. The aim is to deliver changes that have a measurable impact on both systems and workflows. What you will do Collaborate with engineers and other specialists to enhance model efficiency Develop and implement solutions that improve the effectiveness of machine learning systems Contribute to projects that streamline processes and drive productivity gains Impact Your work will help shape improvements in how models operate and how teams at OpenAI achieve their goals. The changes you help deliver will support more effective use of resources and better outcomes for the organization.

Apr 29, 2026

Create account — see all 5,278 results

1 - 20 of 5,278 Jobs

Select all on this page (20)

Apply

Training Performance Engineer

OpenAI

Full-time|Hybrid|San Francisco

Oct 16, 2025

Apply

Performance Modeling Engineer II

OpenAI