Staff Technical Lead For Inference Ml Performance jobs in San Francisco – Browse 2,422 openings on RoboApply Jobs
Staff Technical Lead For Inference Ml Performance jobs in San Francisco
Open roles matching “Staff Technical Lead For Inference Ml Performance” with location signals for San Francisco. 2,422 active listings on RoboApply Jobs.
2,422 jobs found
Staff Technical Lead for Inference & ML Performance
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Manager
Qualifications
The ideal candidate will possess strong leadership skills and a passion for driving technological innovation. We seek individuals who are not only technically adept but also excel in strategic thinking and team collaboration.
About the job
Join fal as we revolutionize the generative-media infrastructure landscape. Our mission is to enhance model inference performance, enabling creative experiences on an unprecedented scale. We are seeking a Staff Technical Lead for Inference & ML Performance, an individual who possesses a unique blend of deep technical knowledge and strategic foresight. In this pivotal role, you will lead a talented team dedicated to building and optimizing cutting-edge inference systems. If you're ready to influence the future of inference performance in a fast-paced and rapidly growing environment, we want to hear from you.
Why This Role Matters
In this role, you will play a crucial part in shaping the future of fal’s inference engine, ensuring that our generative models consistently deliver outstanding performance. Your contributions will directly affect our capacity to swiftly provide innovative creative solutions to a diverse clientele, from individual creators to global brands.
Your Responsibilities
Define and steer the technical direction, guiding your team across various domains including kernels, applied performance, ML compilers, and distributed inference to develop high-performance solutions.
About fal
fal is at the forefront of generative-media infrastructure, dedicated to pushing the limits of model inference performance. By leveraging advanced technologies, we empower creators and brands to deliver seamless and impactful creative experiences.
Join fal as we revolutionize the generative-media infrastructure landscape. Our mission is to enhance model inference performance, enabling creative experiences on an unprecedented scale. We are seeking a Staff Technical Lead for Inference & ML Performance, an individual who possesses a unique blend of deep technical knowledge and strategic foresight. In this pivotal role, you will lead a talented team dedicated to building and optimizing cutting-edge inference systems. If you're ready to influence the future of inference performance in a fast-paced and rapidly growing environment, we want to hear from you.Why This Role MattersIn this role, you will play a crucial part in shaping the future of fal’s inference engine, ensuring that our generative models consistently deliver outstanding performance. Your contributions will directly affect our capacity to swiftly provide innovative creative solutions to a diverse clientele, from individual creators to global brands.Your ResponsibilitiesDefine and steer the technical direction, guiding your team across various domains including kernels, applied performance, ML compilers, and distributed inference to develop high-performance solutions.
Full-time|$200K/yr - $400K/yr|Remote|San Francisco
At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, revolutionizing AI progress by making inference both more accessible and efficient. Our founding team consists of the original creators and key maintainers of vLLM, positioning us uniquely at the nexus of cutting-edge models and advanced hardware.Role OverviewWe are seeking a passionate inference runtime engineer eager to explore and expand the frontiers of LLM and diffusion model serving. As models evolve and grow in complexity with new architectures like mixture-of-experts and multimodal designs, the demand for innovative solutions in our inference engine intensifies. This role places you at the heart of vLLM, where you will enhance model execution across a variety of hardware platforms and architectures. Your contributions will have a direct influence on the future of AI inference.
Join the Sora Team at OpenAIThe Sora team is at the forefront of developing multimodal capabilities within OpenAI’s foundational models. We are a dynamic blend of research and product development, committed to integrating sophisticated multimodal functionalities into our AI offerings. Our focus is on delivering solutions that are not only reliable and intuitive but also resonate with our mission to foster broad societal benefits.Your Role as Inference Technical LeadWe are seeking a talented GPU Inference Engineer to enhance the model serving efficiency for Sora. This pivotal position will empower you to spearhead initiatives aimed at optimizing inference performance and scalability. You will collaborate closely with our researchers to design and develop models that are optimized for inference, directly contributing to the success of our projects.Your contributions will be vital in advancing the team’s overarching objectives, allowing leadership to concentrate on high-impact initiatives by establishing a robust technical foundation.Key Responsibilities:Enhance model serving, inference performance, and overall system efficiency through focused engineering efforts.Implement optimizations targeting kernel and data movement to boost system throughput and reliability.Collaborate with research and product teams to ensure our models operate effectively at scale.Design, construct, and refine essential serving infrastructure to meet Sora’s growth and reliability demands.You Will Excel in This Role If You:Possess deep knowledge in model performance optimization, particularly at the inference level.Have a strong foundation in kernel-level systems, data movement, and low-level performance tuning.Are passionate about scaling high-performing AI systems that address real-world, multimodal challenges.Thrive in ambiguous situations, setting technical direction, and driving complex projects to fruition.This role is based in San Francisco, CA. We follow a hybrid work model requiring 3 in-office days per week and offer relocation assistance to new hires.
At Gimlet Labs, we are pioneering the development of the first heterogeneous neocloud designed specifically for AI workloads. As the demand for AI systems surges, traditional homogeneous infrastructures face critical limits in power, capacity, and cost. Our innovative platform effectively decouples AI workloads from their hardware foundations, intelligently partitioning tasks and orchestrating them to the most suitable hardware for optimal performance and efficiency. This strategy fosters heterogeneous systems that span multiple vendors and generations, including cutting-edge accelerators, enabling significant enhancements in performance and cost-effectiveness at scale.In addition to this foundational work, Gimlet is establishing a robust neocloud for agentic workloads. Our clients benefit from deploying and managing their workloads via stable, production-ready APIs, without the need to navigate hardware selection or performance optimization intricacies.We collaborate with foundation labs, hyperscalers, and AI-native companies to drive real production workloads capable of scaling to gigawatt-class AI datacenters.We are currently seeking a Member of Technical Staff specializing in ML systems and inference. In this pivotal role, you will be responsible for designing and constructing inference systems that facilitate the execution of complete models in real production environments. You will operate at the intersection of model architecture and system performance to ensure that inference processes are swift, predictable, and scalable.This position is perfect for engineers with a deep understanding of modern model execution and a passion for optimizing latency, throughput, and memory utilization across the entire inference lifecycle.
At Magic, we are driven by our mission to develop safe Artificial General Intelligence (AGI) that propels humanity forward in addressing the most critical challenges. We firmly believe that the future of safe AGI lies in automating research and code generation, allowing us to enhance models and tackle alignment issues more effectively than humans alone can manage. Our innovative approach combines cutting-edge pre-training, domain-specific reinforcement learning (RL), ultra-long context, and efficient inference-time computation to realize this vision.Position OverviewAs a Software Engineer within the Inference & RL Systems team, you will play a pivotal role in designing and managing the distributed systems that enable our models to function seamlessly in production, supporting extensive post-training workflows.This position operates at the intersection of model execution and distributed infrastructure, focusing on systems that influence inference latency, throughput, stability, and the reliability of RL and post-training training loops.Our long-context models impose significant execution demands, including KV-cache scaling, managing memory constraints for lengthy sequences, batching strategies, long-horizon trajectory rollouts, and ensuring consistent throughput under real-world workloads. You will be responsible for the infrastructure that ensures both production inference and large-scale RL iterations are efficient and dependable.Key ResponsibilitiesCraft and scale high-performance inference serving systems.Optimize KV-cache management, batching methods, and scheduling processes.Enhance throughput and latency for long-context tasks.Develop and sustain distributed RL and post-training infrastructure.Boost reliability across rollout, evaluation, and reward pipelines.Automate fault detection and recovery mechanisms for serving and RL systems.Analyze and eliminate performance bottlenecks across GPU, networking, and storage components.Collaborate with Kernel and Research teams to ensure alignment between execution systems and model architecture.QualificationsSolid foundation in software engineering and distributed systems.Proven experience in building or managing large-scale inference or training systems.In-depth understanding of GPU execution constraints and memory trade-offs.Experience troubleshooting performance issues in production machine learning systems.Capability to analyze system-level trade-offs between latency, throughput, and cost.
About Our TeamJoin the Future of Computing Research team at OpenAI, an innovative applied research group within the Consumer Devices division. Our mission is to pioneer new methods and models that contribute to our overarching goal of developing Artificial General Intelligence (AGI) for the betterment of humanity.Role OverviewAs the Inference Technical Lead, you will collaborate with world-class machine learning researchers and top-notch design talents to push the boundaries of model capabilities. This position is stationed in San Francisco, CA, offering a hybrid work model that includes 4 days in the office, along with relocation assistance for new hires.Key ResponsibilitiesAssess and select silicon platforms, including GPUs, NPUs, and specialized accelerators, for the deployment of OpenAI models on-device and at the edge.Collaborate closely with research teams to co-design model architectures that satisfy real-world constraints such as latency, memory, power, and bandwidth.Conduct system performance analyses to identify trade-offs in model design, memory hierarchy, compute throughput, and hardware capabilities.Work hand-in-hand with hardware vendors and internal infrastructure teams to launch new accelerators, ensuring efficient execution of transformer workloads.Lead a team of engineers in implementing the low-level inference stack, encompassing kernel development and runtime systems.Navigate challenges to transform emerging research capabilities into scalable solutions.Ideal Candidate ProfileProven experience in evaluating or deploying workloads on GPUs, NPUs, or other specialized accelerators.Strong understanding of transformer model performance characteristics, including attention mechanisms, KV-cache behaviors, and memory bandwidth requirements.Experience designing or optimizing high-performance computing systems, such as inference engines, distributed runtimes, or hardware-aware ML pipelines.Background in building or leading teams focused on low-level performance-critical software, including CUDA kernels, compilers, or ML runtimes.Demonstrated ability to thrive in a fast-paced, innovative environment.
About Our TeamAt OpenAI, our Foundations team is dedicated to examining how model behavior evolves as we scale up models, data, and computing resources. We meticulously analyze the relationships between model architecture, optimization strategies, and training datasets to inform the design and training of next-generation models.About the PositionAs a Team Lead in Research Inference, you will be instrumental in constructing systems that empower advanced AI models to operate efficiently at scale. Your role lies at the crossroads of model research and systems engineering, where you will translate innovative architectural concepts into high-performance inference systems, clearly illustrating the trade-offs in performance, memory usage, and scalability.Your contributions will significantly shape model design, evaluation, and iteration processes across our research organization. By developing and refining high-performance inference infrastructures, you will provide researchers with the tools necessary to explore new ideas while understanding their computational and systems implications.This position does not involve serving products; instead, it supports research through a focus on performance, accuracy, and realism, ensuring that our AI research is firmly rooted in scalable solutions.ResponsibilitiesDesign and develop optimized inference runtimes for large-scale AI models, emphasizing efficiency, reliability, and scalability.Take ownership of optimizing core execution processes, including model execution, memory management, batching, and scheduling.Enhance and expand distributed inference across multiple GPUs, focusing on parallelism, communication patterns, and runtime coordination.Implement and refine critical inference operators and kernels based on real-world workloads.Collaborate closely with research teams to ensure accurate and efficient support for new model architectures within inference systems.Identify and resolve performance bottlenecks through comprehensive profiling, benchmarking, and low-level debugging.Contribute to the observability, correctness, and reliability of large-scale AI systems.Ideal Candidate ProfileExperience in developing production-level inference systems, beyond just training and executing models.Proficient in GPU-centric performance engineering, including managing memory behavior and understanding latency/throughput trade-offs.Strong analytical skills and familiarity with performance profiling tools.
Full-time|$190.9K/yr - $232.8K/yr|On-site|San Francisco, California
P-1285 About This Role Join Databricks as a Staff Software Engineer specializing in GenAI inference, where you will spearhead the architecture, development, and optimization of the inference engine that powers the Databricks Foundation Model API. Your role will be crucial in bridging cutting-edge research with real-world production requirements, ensuring exceptional throughput, minimal latency, and scalable solutions. You will work across the entire GenAI inference stack, including kernels, runtimes, orchestration, memory management, and integration with various frameworks and orchestration systems. What You Will Do Take full ownership of the architecture, design, and implementation of the inference engine, collaborating on a model-serving stack optimized for large-scale LLM inference. Work closely with researchers to integrate new model architectures or features, such as sparsity, activation compression, and mixture-of-experts into the engine. Lead comprehensive optimization efforts focused on latency, throughput, memory efficiency, and hardware utilization across GPUs and other accelerators. Establish and uphold standards for building and maintaining instrumentation, profiling, and tracing tools to identify performance bottlenecks and drive optimizations. Design scalable solutions for routing, batching, scheduling, memory management, and dynamic loading tailored to inference workloads. Guarantee reliability, reproducibility, and fault tolerance in inference pipelines, including capabilities for A/B testing, rollbacks, and model versioning. Collaborate cross-functionally to integrate with federated and distributed inference infrastructure, ensuring effective orchestration across nodes, load balancing, and minimizing communication overhead. Foster collaboration with cross-functional teams, including platform engineers, cloud infrastructure, and security/compliance professionals. Represent the team externally through benchmarks, whitepapers, and contributions to open-source projects. What We Look For A BS/MS/PhD in Computer Science or a related discipline. A solid software engineering background with 6+ years of experience in performance-critical systems. A proven ability to own complex system components and influence architectural decisions from conception to execution. A deep understanding of ML inference internals, including attention mechanisms, MLPs, recurrent modules, quantization, and sparse operations. Hands-on experience with CUDA, GPU programming, and essential libraries (cuBLAS, cuDNN, NCCL, etc.). A strong foundation in distributed systems design, including RPC frameworks, queuing, RPC batching, sharding, and memory partitioning. Demonstrated proficiency in diagnosing and resolving performance bottlenecks across multiple layers (kernel, memory, networking, scheduler).
About UsAt Lemurian Labs, we are dedicated to democratizing AI technology while prioritizing sustainability. Our mission is to create solutions that minimize environmental impact, ensuring that artificial intelligence serves humanity positively. We are committed to responsible innovation and the sustainable growth of AI.We are in the process of developing a state-of-the-art, portable compiler that empowers developers to 'build once, deploy anywhere.' This technology ensures seamless cross-platform integration, allowing for model training in the cloud and deployment at the edge, all while maximizing resource efficiency and scalability.If you are passionate about scaling AI sustainably and are eager to make AI development more powerful and accessible, we invite you to join our team at Lemurian Labs. Together, we can build a future that is innovative and responsible.The RoleWe are seeking a Senior ML Performance Engineer to take charge of designing and leading our Performance Testing Platform from inception. In this pivotal role, you will be recognized as the technical expert in measuring, validating, and enhancing the performance of large language models (including Llama 3.2 70B, DeepSeek, and others) prior to and following compiler optimization on cutting-edge GPU architectures.This is a critical position that will significantly impact our product quality and customer success. You will work at the intersection of Machine Learning systems, GPU architecture, and performance engineering, constructing the infrastructure that substantiates the value of our compiler.
At Runway ML, we are revolutionizing the intersection of art and science through innovative AI technology. Our mission is to build sophisticated world models that transcend traditional artificial intelligence limitations. We believe that to tackle the most pressing challenges—such as robotics, disease, and scientific breakthroughs—we need systems that can learn from experiences just like humans do. By simulating these experiences, we can expedite progress in ways that were previously unimaginable.Our diverse and driven team consists of creative thinkers who are passionate about pushing boundaries and achieving the extraordinary. If you share this ambition and are eager to contribute to our groundbreaking work, we invite you to join us.About the Role*We are open to hiring remotely across North America. We also have offices in NYC, San Francisco, and Seattle.We are on the lookout for a highly skilled and intellectually inquisitive Technical Accounting Manager to be our go-to authority on intricate accounting issues. This position offers significant visibility and is ideal for a professional adept at interpreting complex accounting guidelines, formulating sound conclusions, and translating technical insights into practical accounting practices.
Full-time|$200K/yr - $400K/yr|Remote|San Francisco
At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, significantly enhancing the speed and reducing the cost of AI inference. Our founders, the visionaries behind vLLM, have spent years bridging the gap between advanced models and cutting-edge hardware.About the RoleWe are seeking a skilled performance engineer dedicated to maximizing the computational efficiency of modern accelerators. In this role, you'll develop kernels and implement low-level optimizations that position vLLM as the fastest inference engine globally. Your contributions will be pivotal as your code will execute across a broad spectrum of hardware accelerators, from NVIDIA GPUs to the latest silicon innovations. You'll collaborate closely with hardware vendors to ensure we fully leverage the capabilities of each new generation of chips.
Full-time|On-site|San Francisco, CA | New York City, NY
Role overview Anthropic seeks a Technical Program Manager to support the Cloud Inference team. This position centers on steering technical projects that influence the development of cloud inference solutions. The role is located in either San Francisco, CA or New York City, NY. What you will do Oversee complex initiatives that move Anthropic’s cloud inference technologies forward Collaborate with engineers and partner teams to ensure delivery of dependable solutions Organize and synchronize work across different functions to achieve project objectives and deadlines
At Gimlet Labs, we are pioneering the first heterogeneous neocloud tailored for AI workloads. As the demand for AI systems grows, traditional infrastructure faces significant limitations in terms of power, capacity, and cost. Our innovative platform addresses these challenges by decoupling AI workloads from the hardware, intelligently partitioning tasks, and directing each component to the most suitable hardware for optimal performance and efficiency. This method allows for the creation of heterogeneous systems that span multiple vendors and generations of hardware, including the latest cutting-edge accelerators, achieving substantial improvements in performance and cost-effectiveness.Building upon this robust foundation, Gimlet is developing a production-grade neocloud designed for agentic workloads. Our customers can effortlessly deploy and manage their workloads with stable, production-ready APIs, eliminating the complexities of hardware selection, placement, or low-level performance optimization.We collaborate with foundational labs, hyperscalers, and AI-native companies to drive real production workloads capable of scaling to gigawatt-class AI data centers.We are currently seeking a dedicated Member of Technical Staff specializing in kernels and GPU performance. In this role, you will work closely with accelerators and execution hardware to extract maximum performance from AI workloads across diverse and rapidly evolving platforms. You will analyze low-level execution behaviors, design and optimize kernels, and ensure consistent performance across both established and emerging hardware.This position is perfect for engineers who thrive on deep performance analysis, enjoy exploring hardware trade-offs, and are passionate about transforming theoretical peak performance into tangible real-world outcomes.
Join the team at Mirendil as a Member of Technical Staff specializing in Machine Learning Systems. In this role, you will leverage your expertise to develop innovative solutions that enhance our ML frameworks and contribute to groundbreaking projects in the AI space. Collaborate with top talent in a dynamic environment that promotes creativity and technical excellence.
Full-time|$204K/yr - $247K/yr|On-site|San Francisco, CA - US
At Crusoe, we are on a mission to enhance the availability of energy and intelligence. We are developing the driving force behind a future where individuals can harness the power of AI without compromising on scale, speed, or sustainability.Join the AI revolution with sustainable technology at Crusoe. This is your chance to lead impactful innovations, contribute to meaningful projects, and collaborate with a team dedicated to pioneering responsible and transformative cloud infrastructure.Role Overview:As an integral member of the Crusoe Managed AI Services team, you will oversee the entire product lifecycle for our Managed Inference services. From conceptualization and strategic planning to execution and market introduction, you will be the driving force behind our inference service offerings. Your ability to translate market demands and technical details into succinct product specifications and narratives will be crucial in fostering business growth for Crusoe Cloud.This position is a Staff-level individual contributor role that offers considerable autonomy and influence. You will act as a senior product owner for a pivotal segment of our platform, collaborating closely with engineering, infrastructure, and go-to-market teams to expand and enhance Crusoe’s inference capabilities as the organization evolves.This is a unique opportunity to shape and develop a foundational product area within a rapidly growing and innovative company.Key Responsibilities:Lead the complete product lifecycle for Crusoe’s Managed Inference services, encompassing roadmap creation, execution, and iterative improvements.Convert customer feedback, market insights, and technical limitations into clear product requirements and prioritization strategies.Collaborate effectively with Engineering, Infrastructure, and Platform teams to provide scalable and dependable inference services.Influence product decisions regarding performance, reliability, cost-effectiveness, and user experience for developers.Establish and monitor success metrics for inference services operating in production environments.Work alongside go-to-market teams to facilitate product launches, brand positioning, and customer engagement.Articulate product strategy and decisions clearly to cross-functional partners and leadership.
About Liquid AIBorn from the innovation of MIT CSAIL, Liquid AI is at the forefront of developing general-purpose AI systems that operate seamlessly across various deployment platforms, including data center accelerators and on-device hardware. Our solutions prioritize low latency, minimal memory consumption, privacy, and reliability. We collaborate with leading enterprises in sectors such as consumer electronics, automotive, life sciences, and financial services. As we experience rapid growth, we seek extraordinary talent to join our mission.The OpportunityJoin our Edge Inference team, where we transform Liquid Foundation Models into highly optimized machine code for resource-limited devices such as smartphones, laptops, Raspberry Pis, and smartwatches. As key contributors to llama.cpp, we establish the infrastructure necessary for efficient on-device AI. You will collaborate closely with our technical lead to tackle complex challenges that demand a profound understanding of machine learning architectures and hardware constraints. This role offers high ownership, allowing your code to be deployed in production environments and directly influence model performance on real devices.While San Francisco and Boston are preferred, we welcome applicants from other locations.
Join our dynamic team at Perplexity as an AI Inference Engineer, where you will be at the forefront of deploying cutting-edge machine learning models for real-time inference. Our tech stack includes Python, Rust, C++, PyTorch, Triton, CUDA, and Kubernetes, providing you with a chance to work on large-scale applications that make a real impact.Key ResponsibilitiesDesign and develop APIs for AI inference that cater to both internal and external stakeholders.Conduct benchmarking and identify bottlenecks within our inference stack to enhance performance.Ensure the reliability and observability of our systems while promptly addressing any outages.Investigate innovative research and implement optimizations for LLM inference.
Cohere builds and deploys advanced AI models used by developers and enterprises. These models support applications like content generation, semantic search, retrieval-augmented generation (RAG), and intelligent agents. The team’s work aims to make AI more accessible and practical for real-world use. Each person at Cohere plays a direct role in strengthening the models and increasing their value for clients. The company values practical outcomes and continuous improvement, focusing on delivering reliable technology to users. The team includes researchers, engineers, designers, and professionals from a wide range of backgrounds. Cohere believes that diverse perspectives help create better products. The company welcomes those interested in shaping the future of AI to join its mission.
Arine, a fast-growing healthcare technology and clinical services company based in San Francisco, is dedicated to ensuring individuals receive the safest and most effective treatments tailored to their unique healthcare needs. Our mission is to redefine healthcare excellence by addressing the challenges posed by medications that can often do more harm than good.In the U.S., incorrect medications and dosing result in over $528 billion in waste and preventable harm annually. Our innovative software platform (SaaS) leverages cutting-edge data science, machine learning, AI, and deep clinical expertise to offer a patient-centered approach to medication management. We develop and implement personalized care plans at scale, benefiting both patients and their care teams.Arine is committed to enhancing the lives of complex patients, who significantly impact healthcare costs and are often challenging to identify and assist. These patients face a myriad of issues, including complicated prescriptions across multiple medications and providers, chronic disease management challenges, and access to care difficulties. Supported by leading healthcare investors and collaborating with top organizations, we provide actionable recommendations and enable clinical interventions that lead to measurable health improvements and cost savings for our customers.Why Choose to Work at Arine?Exceptional Team and Culture - Our shared mission inspires us to excel and fosters a culture of relentless passion and innovation, positioning us as market leaders in medication intelligence.Making a Tangible Impact in Healthcare - We are saving lives and enabling individuals to achieve better health outcomes.
Full-time|$145.7K/yr - $300.1K/yr|Remote|San Francisco, CA, US; Remote, US
Join Pinterest:At Pinterest, we empower millions globally to discover creative ideas, envision new possibilities, and curate lasting memories. Our mission is to inspire everyone to create a life they love, driven by the talented individuals behind our innovative platform.Embark on a career where you can fuel innovation for millions, transform your passions into growth opportunities, value diverse experiences, and enjoy the flexibility to thrive. Building a career you love? It’s absolutely achievable!At Pinterest, AI is not just an enhancement; it’s a critical partner enhancing creativity and expanding our reach. We seek candidates eager to embrace this journey. Throughout our interview process, we prioritize your ability to articulate your thought processes, showcasing not only your knowledge but also your collaborative skills with AI. Discover more about our AI interview philosophy and its role in our recruitment process here.The Team:As a Technical Program Manager at Pinterest, you will take ownership with a proactive approach and technical acumen. The Platform Team oversees program governance in Infrastructure, Infra Finance, Data Engineering and Security, Compliance, and cloud budget management.Your Responsibilities:As a Staff Technical Program Manager with a focus on cross-engineering projects, you will lead strategic initiatives vital to enhancing Pinterest’s ML/AI Platform and foundational infrastructure.Lead Strategic ML/AI Platform Programs: Champion and execute high-impact, cross-engineering initiatives essential for advancing Pinterest's ML/AI Platform, GenAI infrastructure, and Agent Platform, ensuring outcomes from initial concept through measurable execution.AI-First Execution Mindset: Employ GenAI as the primary model for program execution, producing AI-assisted drafts of core program documents and modernizing high-effort workflows.
Join fal as we revolutionize the generative-media infrastructure landscape. Our mission is to enhance model inference performance, enabling creative experiences on an unprecedented scale. We are seeking a Staff Technical Lead for Inference & ML Performance, an individual who possesses a unique blend of deep technical knowledge and strategic foresight. In this pivotal role, you will lead a talented team dedicated to building and optimizing cutting-edge inference systems. If you're ready to influence the future of inference performance in a fast-paced and rapidly growing environment, we want to hear from you.Why This Role MattersIn this role, you will play a crucial part in shaping the future of fal’s inference engine, ensuring that our generative models consistently deliver outstanding performance. Your contributions will directly affect our capacity to swiftly provide innovative creative solutions to a diverse clientele, from individual creators to global brands.Your ResponsibilitiesDefine and steer the technical direction, guiding your team across various domains including kernels, applied performance, ML compilers, and distributed inference to develop high-performance solutions.
Full-time|$200K/yr - $400K/yr|Remote|San Francisco
At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, revolutionizing AI progress by making inference both more accessible and efficient. Our founding team consists of the original creators and key maintainers of vLLM, positioning us uniquely at the nexus of cutting-edge models and advanced hardware.Role OverviewWe are seeking a passionate inference runtime engineer eager to explore and expand the frontiers of LLM and diffusion model serving. As models evolve and grow in complexity with new architectures like mixture-of-experts and multimodal designs, the demand for innovative solutions in our inference engine intensifies. This role places you at the heart of vLLM, where you will enhance model execution across a variety of hardware platforms and architectures. Your contributions will have a direct influence on the future of AI inference.
Join the Sora Team at OpenAIThe Sora team is at the forefront of developing multimodal capabilities within OpenAI’s foundational models. We are a dynamic blend of research and product development, committed to integrating sophisticated multimodal functionalities into our AI offerings. Our focus is on delivering solutions that are not only reliable and intuitive but also resonate with our mission to foster broad societal benefits.Your Role as Inference Technical LeadWe are seeking a talented GPU Inference Engineer to enhance the model serving efficiency for Sora. This pivotal position will empower you to spearhead initiatives aimed at optimizing inference performance and scalability. You will collaborate closely with our researchers to design and develop models that are optimized for inference, directly contributing to the success of our projects.Your contributions will be vital in advancing the team’s overarching objectives, allowing leadership to concentrate on high-impact initiatives by establishing a robust technical foundation.Key Responsibilities:Enhance model serving, inference performance, and overall system efficiency through focused engineering efforts.Implement optimizations targeting kernel and data movement to boost system throughput and reliability.Collaborate with research and product teams to ensure our models operate effectively at scale.Design, construct, and refine essential serving infrastructure to meet Sora’s growth and reliability demands.You Will Excel in This Role If You:Possess deep knowledge in model performance optimization, particularly at the inference level.Have a strong foundation in kernel-level systems, data movement, and low-level performance tuning.Are passionate about scaling high-performing AI systems that address real-world, multimodal challenges.Thrive in ambiguous situations, setting technical direction, and driving complex projects to fruition.This role is based in San Francisco, CA. We follow a hybrid work model requiring 3 in-office days per week and offer relocation assistance to new hires.
At Gimlet Labs, we are pioneering the development of the first heterogeneous neocloud designed specifically for AI workloads. As the demand for AI systems surges, traditional homogeneous infrastructures face critical limits in power, capacity, and cost. Our innovative platform effectively decouples AI workloads from their hardware foundations, intelligently partitioning tasks and orchestrating them to the most suitable hardware for optimal performance and efficiency. This strategy fosters heterogeneous systems that span multiple vendors and generations, including cutting-edge accelerators, enabling significant enhancements in performance and cost-effectiveness at scale.In addition to this foundational work, Gimlet is establishing a robust neocloud for agentic workloads. Our clients benefit from deploying and managing their workloads via stable, production-ready APIs, without the need to navigate hardware selection or performance optimization intricacies.We collaborate with foundation labs, hyperscalers, and AI-native companies to drive real production workloads capable of scaling to gigawatt-class AI datacenters.We are currently seeking a Member of Technical Staff specializing in ML systems and inference. In this pivotal role, you will be responsible for designing and constructing inference systems that facilitate the execution of complete models in real production environments. You will operate at the intersection of model architecture and system performance to ensure that inference processes are swift, predictable, and scalable.This position is perfect for engineers with a deep understanding of modern model execution and a passion for optimizing latency, throughput, and memory utilization across the entire inference lifecycle.
At Magic, we are driven by our mission to develop safe Artificial General Intelligence (AGI) that propels humanity forward in addressing the most critical challenges. We firmly believe that the future of safe AGI lies in automating research and code generation, allowing us to enhance models and tackle alignment issues more effectively than humans alone can manage. Our innovative approach combines cutting-edge pre-training, domain-specific reinforcement learning (RL), ultra-long context, and efficient inference-time computation to realize this vision.Position OverviewAs a Software Engineer within the Inference & RL Systems team, you will play a pivotal role in designing and managing the distributed systems that enable our models to function seamlessly in production, supporting extensive post-training workflows.This position operates at the intersection of model execution and distributed infrastructure, focusing on systems that influence inference latency, throughput, stability, and the reliability of RL and post-training training loops.Our long-context models impose significant execution demands, including KV-cache scaling, managing memory constraints for lengthy sequences, batching strategies, long-horizon trajectory rollouts, and ensuring consistent throughput under real-world workloads. You will be responsible for the infrastructure that ensures both production inference and large-scale RL iterations are efficient and dependable.Key ResponsibilitiesCraft and scale high-performance inference serving systems.Optimize KV-cache management, batching methods, and scheduling processes.Enhance throughput and latency for long-context tasks.Develop and sustain distributed RL and post-training infrastructure.Boost reliability across rollout, evaluation, and reward pipelines.Automate fault detection and recovery mechanisms for serving and RL systems.Analyze and eliminate performance bottlenecks across GPU, networking, and storage components.Collaborate with Kernel and Research teams to ensure alignment between execution systems and model architecture.QualificationsSolid foundation in software engineering and distributed systems.Proven experience in building or managing large-scale inference or training systems.In-depth understanding of GPU execution constraints and memory trade-offs.Experience troubleshooting performance issues in production machine learning systems.Capability to analyze system-level trade-offs between latency, throughput, and cost.
About Our TeamJoin the Future of Computing Research team at OpenAI, an innovative applied research group within the Consumer Devices division. Our mission is to pioneer new methods and models that contribute to our overarching goal of developing Artificial General Intelligence (AGI) for the betterment of humanity.Role OverviewAs the Inference Technical Lead, you will collaborate with world-class machine learning researchers and top-notch design talents to push the boundaries of model capabilities. This position is stationed in San Francisco, CA, offering a hybrid work model that includes 4 days in the office, along with relocation assistance for new hires.Key ResponsibilitiesAssess and select silicon platforms, including GPUs, NPUs, and specialized accelerators, for the deployment of OpenAI models on-device and at the edge.Collaborate closely with research teams to co-design model architectures that satisfy real-world constraints such as latency, memory, power, and bandwidth.Conduct system performance analyses to identify trade-offs in model design, memory hierarchy, compute throughput, and hardware capabilities.Work hand-in-hand with hardware vendors and internal infrastructure teams to launch new accelerators, ensuring efficient execution of transformer workloads.Lead a team of engineers in implementing the low-level inference stack, encompassing kernel development and runtime systems.Navigate challenges to transform emerging research capabilities into scalable solutions.Ideal Candidate ProfileProven experience in evaluating or deploying workloads on GPUs, NPUs, or other specialized accelerators.Strong understanding of transformer model performance characteristics, including attention mechanisms, KV-cache behaviors, and memory bandwidth requirements.Experience designing or optimizing high-performance computing systems, such as inference engines, distributed runtimes, or hardware-aware ML pipelines.Background in building or leading teams focused on low-level performance-critical software, including CUDA kernels, compilers, or ML runtimes.Demonstrated ability to thrive in a fast-paced, innovative environment.
About Our TeamAt OpenAI, our Foundations team is dedicated to examining how model behavior evolves as we scale up models, data, and computing resources. We meticulously analyze the relationships between model architecture, optimization strategies, and training datasets to inform the design and training of next-generation models.About the PositionAs a Team Lead in Research Inference, you will be instrumental in constructing systems that empower advanced AI models to operate efficiently at scale. Your role lies at the crossroads of model research and systems engineering, where you will translate innovative architectural concepts into high-performance inference systems, clearly illustrating the trade-offs in performance, memory usage, and scalability.Your contributions will significantly shape model design, evaluation, and iteration processes across our research organization. By developing and refining high-performance inference infrastructures, you will provide researchers with the tools necessary to explore new ideas while understanding their computational and systems implications.This position does not involve serving products; instead, it supports research through a focus on performance, accuracy, and realism, ensuring that our AI research is firmly rooted in scalable solutions.ResponsibilitiesDesign and develop optimized inference runtimes for large-scale AI models, emphasizing efficiency, reliability, and scalability.Take ownership of optimizing core execution processes, including model execution, memory management, batching, and scheduling.Enhance and expand distributed inference across multiple GPUs, focusing on parallelism, communication patterns, and runtime coordination.Implement and refine critical inference operators and kernels based on real-world workloads.Collaborate closely with research teams to ensure accurate and efficient support for new model architectures within inference systems.Identify and resolve performance bottlenecks through comprehensive profiling, benchmarking, and low-level debugging.Contribute to the observability, correctness, and reliability of large-scale AI systems.Ideal Candidate ProfileExperience in developing production-level inference systems, beyond just training and executing models.Proficient in GPU-centric performance engineering, including managing memory behavior and understanding latency/throughput trade-offs.Strong analytical skills and familiarity with performance profiling tools.
Full-time|$190.9K/yr - $232.8K/yr|On-site|San Francisco, California
P-1285 About This Role Join Databricks as a Staff Software Engineer specializing in GenAI inference, where you will spearhead the architecture, development, and optimization of the inference engine that powers the Databricks Foundation Model API. Your role will be crucial in bridging cutting-edge research with real-world production requirements, ensuring exceptional throughput, minimal latency, and scalable solutions. You will work across the entire GenAI inference stack, including kernels, runtimes, orchestration, memory management, and integration with various frameworks and orchestration systems. What You Will Do Take full ownership of the architecture, design, and implementation of the inference engine, collaborating on a model-serving stack optimized for large-scale LLM inference. Work closely with researchers to integrate new model architectures or features, such as sparsity, activation compression, and mixture-of-experts into the engine. Lead comprehensive optimization efforts focused on latency, throughput, memory efficiency, and hardware utilization across GPUs and other accelerators. Establish and uphold standards for building and maintaining instrumentation, profiling, and tracing tools to identify performance bottlenecks and drive optimizations. Design scalable solutions for routing, batching, scheduling, memory management, and dynamic loading tailored to inference workloads. Guarantee reliability, reproducibility, and fault tolerance in inference pipelines, including capabilities for A/B testing, rollbacks, and model versioning. Collaborate cross-functionally to integrate with federated and distributed inference infrastructure, ensuring effective orchestration across nodes, load balancing, and minimizing communication overhead. Foster collaboration with cross-functional teams, including platform engineers, cloud infrastructure, and security/compliance professionals. Represent the team externally through benchmarks, whitepapers, and contributions to open-source projects. What We Look For A BS/MS/PhD in Computer Science or a related discipline. A solid software engineering background with 6+ years of experience in performance-critical systems. A proven ability to own complex system components and influence architectural decisions from conception to execution. A deep understanding of ML inference internals, including attention mechanisms, MLPs, recurrent modules, quantization, and sparse operations. Hands-on experience with CUDA, GPU programming, and essential libraries (cuBLAS, cuDNN, NCCL, etc.). A strong foundation in distributed systems design, including RPC frameworks, queuing, RPC batching, sharding, and memory partitioning. Demonstrated proficiency in diagnosing and resolving performance bottlenecks across multiple layers (kernel, memory, networking, scheduler).
About UsAt Lemurian Labs, we are dedicated to democratizing AI technology while prioritizing sustainability. Our mission is to create solutions that minimize environmental impact, ensuring that artificial intelligence serves humanity positively. We are committed to responsible innovation and the sustainable growth of AI.We are in the process of developing a state-of-the-art, portable compiler that empowers developers to 'build once, deploy anywhere.' This technology ensures seamless cross-platform integration, allowing for model training in the cloud and deployment at the edge, all while maximizing resource efficiency and scalability.If you are passionate about scaling AI sustainably and are eager to make AI development more powerful and accessible, we invite you to join our team at Lemurian Labs. Together, we can build a future that is innovative and responsible.The RoleWe are seeking a Senior ML Performance Engineer to take charge of designing and leading our Performance Testing Platform from inception. In this pivotal role, you will be recognized as the technical expert in measuring, validating, and enhancing the performance of large language models (including Llama 3.2 70B, DeepSeek, and others) prior to and following compiler optimization on cutting-edge GPU architectures.This is a critical position that will significantly impact our product quality and customer success. You will work at the intersection of Machine Learning systems, GPU architecture, and performance engineering, constructing the infrastructure that substantiates the value of our compiler.
At Runway ML, we are revolutionizing the intersection of art and science through innovative AI technology. Our mission is to build sophisticated world models that transcend traditional artificial intelligence limitations. We believe that to tackle the most pressing challenges—such as robotics, disease, and scientific breakthroughs—we need systems that can learn from experiences just like humans do. By simulating these experiences, we can expedite progress in ways that were previously unimaginable.Our diverse and driven team consists of creative thinkers who are passionate about pushing boundaries and achieving the extraordinary. If you share this ambition and are eager to contribute to our groundbreaking work, we invite you to join us.About the Role*We are open to hiring remotely across North America. We also have offices in NYC, San Francisco, and Seattle.We are on the lookout for a highly skilled and intellectually inquisitive Technical Accounting Manager to be our go-to authority on intricate accounting issues. This position offers significant visibility and is ideal for a professional adept at interpreting complex accounting guidelines, formulating sound conclusions, and translating technical insights into practical accounting practices.
Full-time|$200K/yr - $400K/yr|Remote|San Francisco
At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, significantly enhancing the speed and reducing the cost of AI inference. Our founders, the visionaries behind vLLM, have spent years bridging the gap between advanced models and cutting-edge hardware.About the RoleWe are seeking a skilled performance engineer dedicated to maximizing the computational efficiency of modern accelerators. In this role, you'll develop kernels and implement low-level optimizations that position vLLM as the fastest inference engine globally. Your contributions will be pivotal as your code will execute across a broad spectrum of hardware accelerators, from NVIDIA GPUs to the latest silicon innovations. You'll collaborate closely with hardware vendors to ensure we fully leverage the capabilities of each new generation of chips.
Full-time|On-site|San Francisco, CA | New York City, NY
Role overview Anthropic seeks a Technical Program Manager to support the Cloud Inference team. This position centers on steering technical projects that influence the development of cloud inference solutions. The role is located in either San Francisco, CA or New York City, NY. What you will do Oversee complex initiatives that move Anthropic’s cloud inference technologies forward Collaborate with engineers and partner teams to ensure delivery of dependable solutions Organize and synchronize work across different functions to achieve project objectives and deadlines
At Gimlet Labs, we are pioneering the first heterogeneous neocloud tailored for AI workloads. As the demand for AI systems grows, traditional infrastructure faces significant limitations in terms of power, capacity, and cost. Our innovative platform addresses these challenges by decoupling AI workloads from the hardware, intelligently partitioning tasks, and directing each component to the most suitable hardware for optimal performance and efficiency. This method allows for the creation of heterogeneous systems that span multiple vendors and generations of hardware, including the latest cutting-edge accelerators, achieving substantial improvements in performance and cost-effectiveness.Building upon this robust foundation, Gimlet is developing a production-grade neocloud designed for agentic workloads. Our customers can effortlessly deploy and manage their workloads with stable, production-ready APIs, eliminating the complexities of hardware selection, placement, or low-level performance optimization.We collaborate with foundational labs, hyperscalers, and AI-native companies to drive real production workloads capable of scaling to gigawatt-class AI data centers.We are currently seeking a dedicated Member of Technical Staff specializing in kernels and GPU performance. In this role, you will work closely with accelerators and execution hardware to extract maximum performance from AI workloads across diverse and rapidly evolving platforms. You will analyze low-level execution behaviors, design and optimize kernels, and ensure consistent performance across both established and emerging hardware.This position is perfect for engineers who thrive on deep performance analysis, enjoy exploring hardware trade-offs, and are passionate about transforming theoretical peak performance into tangible real-world outcomes.
Join the team at Mirendil as a Member of Technical Staff specializing in Machine Learning Systems. In this role, you will leverage your expertise to develop innovative solutions that enhance our ML frameworks and contribute to groundbreaking projects in the AI space. Collaborate with top talent in a dynamic environment that promotes creativity and technical excellence.
Full-time|$204K/yr - $247K/yr|On-site|San Francisco, CA - US
At Crusoe, we are on a mission to enhance the availability of energy and intelligence. We are developing the driving force behind a future where individuals can harness the power of AI without compromising on scale, speed, or sustainability.Join the AI revolution with sustainable technology at Crusoe. This is your chance to lead impactful innovations, contribute to meaningful projects, and collaborate with a team dedicated to pioneering responsible and transformative cloud infrastructure.Role Overview:As an integral member of the Crusoe Managed AI Services team, you will oversee the entire product lifecycle for our Managed Inference services. From conceptualization and strategic planning to execution and market introduction, you will be the driving force behind our inference service offerings. Your ability to translate market demands and technical details into succinct product specifications and narratives will be crucial in fostering business growth for Crusoe Cloud.This position is a Staff-level individual contributor role that offers considerable autonomy and influence. You will act as a senior product owner for a pivotal segment of our platform, collaborating closely with engineering, infrastructure, and go-to-market teams to expand and enhance Crusoe’s inference capabilities as the organization evolves.This is a unique opportunity to shape and develop a foundational product area within a rapidly growing and innovative company.Key Responsibilities:Lead the complete product lifecycle for Crusoe’s Managed Inference services, encompassing roadmap creation, execution, and iterative improvements.Convert customer feedback, market insights, and technical limitations into clear product requirements and prioritization strategies.Collaborate effectively with Engineering, Infrastructure, and Platform teams to provide scalable and dependable inference services.Influence product decisions regarding performance, reliability, cost-effectiveness, and user experience for developers.Establish and monitor success metrics for inference services operating in production environments.Work alongside go-to-market teams to facilitate product launches, brand positioning, and customer engagement.Articulate product strategy and decisions clearly to cross-functional partners and leadership.
About Liquid AIBorn from the innovation of MIT CSAIL, Liquid AI is at the forefront of developing general-purpose AI systems that operate seamlessly across various deployment platforms, including data center accelerators and on-device hardware. Our solutions prioritize low latency, minimal memory consumption, privacy, and reliability. We collaborate with leading enterprises in sectors such as consumer electronics, automotive, life sciences, and financial services. As we experience rapid growth, we seek extraordinary talent to join our mission.The OpportunityJoin our Edge Inference team, where we transform Liquid Foundation Models into highly optimized machine code for resource-limited devices such as smartphones, laptops, Raspberry Pis, and smartwatches. As key contributors to llama.cpp, we establish the infrastructure necessary for efficient on-device AI. You will collaborate closely with our technical lead to tackle complex challenges that demand a profound understanding of machine learning architectures and hardware constraints. This role offers high ownership, allowing your code to be deployed in production environments and directly influence model performance on real devices.While San Francisco and Boston are preferred, we welcome applicants from other locations.
Join our dynamic team at Perplexity as an AI Inference Engineer, where you will be at the forefront of deploying cutting-edge machine learning models for real-time inference. Our tech stack includes Python, Rust, C++, PyTorch, Triton, CUDA, and Kubernetes, providing you with a chance to work on large-scale applications that make a real impact.Key ResponsibilitiesDesign and develop APIs for AI inference that cater to both internal and external stakeholders.Conduct benchmarking and identify bottlenecks within our inference stack to enhance performance.Ensure the reliability and observability of our systems while promptly addressing any outages.Investigate innovative research and implement optimizations for LLM inference.
Cohere builds and deploys advanced AI models used by developers and enterprises. These models support applications like content generation, semantic search, retrieval-augmented generation (RAG), and intelligent agents. The team’s work aims to make AI more accessible and practical for real-world use. Each person at Cohere plays a direct role in strengthening the models and increasing their value for clients. The company values practical outcomes and continuous improvement, focusing on delivering reliable technology to users. The team includes researchers, engineers, designers, and professionals from a wide range of backgrounds. Cohere believes that diverse perspectives help create better products. The company welcomes those interested in shaping the future of AI to join its mission.
Arine, a fast-growing healthcare technology and clinical services company based in San Francisco, is dedicated to ensuring individuals receive the safest and most effective treatments tailored to their unique healthcare needs. Our mission is to redefine healthcare excellence by addressing the challenges posed by medications that can often do more harm than good.In the U.S., incorrect medications and dosing result in over $528 billion in waste and preventable harm annually. Our innovative software platform (SaaS) leverages cutting-edge data science, machine learning, AI, and deep clinical expertise to offer a patient-centered approach to medication management. We develop and implement personalized care plans at scale, benefiting both patients and their care teams.Arine is committed to enhancing the lives of complex patients, who significantly impact healthcare costs and are often challenging to identify and assist. These patients face a myriad of issues, including complicated prescriptions across multiple medications and providers, chronic disease management challenges, and access to care difficulties. Supported by leading healthcare investors and collaborating with top organizations, we provide actionable recommendations and enable clinical interventions that lead to measurable health improvements and cost savings for our customers.Why Choose to Work at Arine?Exceptional Team and Culture - Our shared mission inspires us to excel and fosters a culture of relentless passion and innovation, positioning us as market leaders in medication intelligence.Making a Tangible Impact in Healthcare - We are saving lives and enabling individuals to achieve better health outcomes.
Full-time|$145.7K/yr - $300.1K/yr|Remote|San Francisco, CA, US; Remote, US
Join Pinterest:At Pinterest, we empower millions globally to discover creative ideas, envision new possibilities, and curate lasting memories. Our mission is to inspire everyone to create a life they love, driven by the talented individuals behind our innovative platform.Embark on a career where you can fuel innovation for millions, transform your passions into growth opportunities, value diverse experiences, and enjoy the flexibility to thrive. Building a career you love? It’s absolutely achievable!At Pinterest, AI is not just an enhancement; it’s a critical partner enhancing creativity and expanding our reach. We seek candidates eager to embrace this journey. Throughout our interview process, we prioritize your ability to articulate your thought processes, showcasing not only your knowledge but also your collaborative skills with AI. Discover more about our AI interview philosophy and its role in our recruitment process here.The Team:As a Technical Program Manager at Pinterest, you will take ownership with a proactive approach and technical acumen. The Platform Team oversees program governance in Infrastructure, Infra Finance, Data Engineering and Security, Compliance, and cloud budget management.Your Responsibilities:As a Staff Technical Program Manager with a focus on cross-engineering projects, you will lead strategic initiatives vital to enhancing Pinterest’s ML/AI Platform and foundational infrastructure.Lead Strategic ML/AI Platform Programs: Champion and execute high-impact, cross-engineering initiatives essential for advancing Pinterest's ML/AI Platform, GenAI infrastructure, and Agent Platform, ensuring outcomes from initial concept through measurable execution.AI-First Execution Mindset: Employ GenAI as the primary model for program execution, producing AI-assisted drafts of core program documents and modernizing high-effort workflows.
Apr 7, 2026
Sign in to browse more jobs
Create account — see all 2,422 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.