Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Entry Level
Qualifications
The ideal candidate will possess a strong background in software engineering and artificial intelligence. Key qualifications include:Proficiency in programming languages such as Python, Java, or C++. Experience with machine learning frameworks and libraries. Strong understanding of voice recognition technologies and natural language processing. Ability to work collaboratively in a fast-paced environment. Excellent problem-solving skills and attention to detail.
About the job
Baseten seeks a Software Engineer to focus on Voice AI within the Inference Runtime team. This San Francisco-based role centers on building and refining AI models that power voice interaction features in Baseten products.
Role overview
This position involves hands-on work with AI models for voice-driven applications. The engineer will help shape how users interact with voice technology by developing and optimizing the underlying systems.
What you will do
Develop and optimize AI models for applications that use voice as a primary interface
Work on inference runtime systems to support responsive and intelligent voice experiences
Contribute directly to Baseten's products, influencing the future of voice technology for users
Location
This position is based in San Francisco.
About Baseten
Baseten is a forward-thinking technology company located in the heart of San Francisco. We are dedicated to pushing the boundaries of voice AI, creating innovative solutions that enhance user engagement and experience. Our vibrant team thrives on creativity and collaboration, and we are looking for passionate individuals to join us on this exciting journey.
Baseten seeks a Software Engineer to focus on Voice AI within the Inference Runtime team. This San Francisco-based role centers on building and refining AI models that power voice interaction features in Baseten products. Role overview This position involves hands-on work with AI models for voice-driven applications. The engineer will help shape how users interact with voice technology by developing and optimizing the underlying systems. What you will do Develop and optimize AI models for applications that use voice as a primary interface Work on inference runtime systems to support responsive and intelligent voice experiences Contribute directly to Baseten's products, influencing the future of voice technology for users Location This position is based in San Francisco.
About UsAt Physical Intelligence, we are pioneers in integrating general-purpose AI into the physical realm. Our team comprises dedicated engineers, scientists, roboticists, and innovators focused on creating advanced foundation models and learning algorithms that will empower the robots of today and the physically-actuated devices of tomorrow.To achieve outstanding real-world performance, we prioritize ultra-low system latency, reliable sensor pipelines, and comprehensive engineering that ensures perception and control loops function seamlessly at real-time speeds.As a Runtime Software Engineer, you will be at the forefront of developing low-latency, high-throughput systems that support our physical intelligence model. Your role will not involve designing ML models; instead, you will optimize the entire stack from the operating system to the camera pipeline and networking, ensuring flawless production execution. You will work closely with researchers, platform engineers, and robotics operators to identify performance bottlenecks and maximize system efficiency.The TeamOur Runtime team is crucial in creating the foundational platform that our robots, sensors, and evaluation systems depend on. This team excels in Linux systems engineering, camera and sensor integration, robot actuator control, networking, real-time input/output, and performance optimization tools. They ensure that our machine learning models and control systems function within strict latency constraints and are resilient under real-world conditions.Your ResponsibilitiesManage Real-Time Pipelines: Design and implement low-latency, high-reliability sensor and actuator pipelines utilizing Linux, drivers, and middleware.Enhance System Performance: Analyze and optimize computational efficiency across I/O, memory, scheduling, networking, and storage to satisfy real-time requirements and boost throughput.Develop OS-Level Features: Modify or expand Linux components, drivers, and scheduling mechanisms to ensure deterministic performance under load.Streaming & Video Systems: Create and refine real-time video streaming systems with precise frame timing and packet scheduling.Ensure Reliability & Debugging: Develop tools for profiling, tracing, and resolving timing challenges across distributed systems and hardware interfaces.Collaborate Across Functions: Partner with researchers, hardware engineers, and operational teams to drive system performance and reliability.
Full-time|$300K/yr - $300K/yr|On-site|San Francisco
ABOUT BASETENAt Baseten, we empower the leading AI companies of today, including Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer, by providing essential inference capabilities. Our unique blend of applied AI research, adaptable infrastructure, and intuitive developer tools enables innovators at the cutting edge of AI to seamlessly transition advanced models into production. With our recent success in securing a $300M Series E funding round, backed by notable investors such as BOND, IVP, Spark Capital, Greylock, and Conviction, we're on an exciting growth trajectory. Join our team and contribute to the platform that engineers rely on to launch AI-driven products.THE ROLEAs an Applied AI Inference Engineer at Baseten, you'll collaborate closely with clients to design, develop, and implement high-performance AI applications using our platform. You will guide customers through the entire process, from initial concept to deployment, transforming vague business objectives into dependable, observable solutions that meet defined quality, latency, and cost metrics.This position is ideal for innovative engineers eager to gain insight into how modern organizations scale AI adoption. You will thrive if you enjoy a multifaceted role that intersects product development, software engineering, performance optimization, and direct customer engagement.It’s essential to note that this position requires hands-on coding and software development, while also encompassing elements of product management, technical customer success, and pre-sales engineering.EXAMPLE INITIATIVESExplore insights from our Forward Deployed Engineering team through these blog posts: Forward Deployed Engineering on the frontier of AIThe fastest, most accurate Whisper transcriptionDeploy production-ready model servers from Docker imagesDeploy custom ComfyUI workflows as APIs...
Join Sysdig as a Senior Software Engineer specializing in Runtime. In this role, you'll be at the forefront of developing innovative solutions that enhance our runtime security and monitoring products.You'll collaborate with a dynamic team, leveraging cutting-edge technologies to tackle complex challenges. We are looking for passionate engineers who thrive in a fast-paced environment and are eager to make a significant impact.
At Lemurian Labs, we’re dedicated to democratizing the power of AI while ensuring a minimal environmental impact. Our focus is on the profound influence AI has on society and the planet, and we’re committed to creating a robust foundation for its future, promoting sustainable and responsible AI growth. After all, innovation should benefit the world.We are developing a state-of-the-art, high-performance compiler that allows developers to 'build once, deploy anywhere.' This means seamless cross-platform functionality—train your models in the cloud and deploy them at the edge, all while optimizing for resource efficiency and scalability.If you're passionate about sustainably scaling AI and excited about making powerful AI development accessible, we invite you to join our team at Lemurian Labs. Be part of a fun and innovative environment where you can help shape the future without leaving a mess behind.
About Our TeamJoin OpenAI’s dynamic Inference team, where we empower the deployment of cutting-edge AI models, including our renowned GPT models, advanced Image Generation capabilities, and Whisper, across diverse platforms. Our mission is to ensure these models are not only high-performing and scalable but also available for real-world applications. Collaborating closely with our Research team, we’re committed to bringing the next generation of AI innovations to fruition. As a compact, agile team, we prioritize delivering an exceptional developer experience while continuously pushing the frontiers of artificial intelligence.As we expand our focus into multimodal inference, we are building the necessary infrastructure to support models that process images, audio, and other non-text modalities. This work involves tackling diverse model sizes and interactions, managing complex input/output formats, and ensuring seamless collaboration between product and research teams.About The RoleWe are seeking a passionate Software Engineer to aid in the large-scale deployment of OpenAI’s multimodal models. You will join a small yet impactful team dedicated to creating robust, high-performance infrastructure for real-time audio, image, and various multimodal workloads in production environments.This position is inherently collaborative; you will work directly with researchers who develop these models and with product teams to define novel interaction modalities. Your contributions will enable users to generate speech, interpret images, and engage with models in innovative ways that extend beyond traditional text-based interactions.Key Responsibilities:Design and implement advanced inference infrastructure for large-scale multimodal models.Optimize systems for high-throughput and low-latency processing of image and audio inputs and outputs.Facilitate the transition of experimental research workflows into dependable production services.Engage closely with researchers, infrastructure teams, and product engineers to deploy state-of-the-art capabilities.Contribute to systemic enhancements, including GPU utilization, tensor parallelism, and hardware abstraction layers.You May Excel In This Role If You:Have a proven track record of building and scaling inference systems for large language models or multimodal architectures.Possess experience with GPU-based machine learning workloads and a solid understanding of the performance dynamics associated with large models, particularly with intricate data types like images or audio.Thrive in a fast-paced, experimental environment and enjoy collaborating with cross-functional teams to drive impactful results.
About Our TeamJoin the Inference team at OpenAI, where we leverage cutting-edge research and technology to deliver exceptional AI products to consumers, enterprises, and developers. Our mission is to empower users to harness the full potential of our advanced AI models, enabling unprecedented capabilities. We prioritize efficient and high-performance model inference while accelerating research advancements.About the RoleWe are seeking a passionate Software Engineer to optimize some of the world's largest and most sophisticated AI models for deployment in high-volume, low-latency, and highly available production and research environments.Key ResponsibilitiesCollaborate with machine learning researchers, engineers, and product managers to transition our latest technologies into production.Work closely with researchers to enable advanced research initiatives through innovative engineering solutions.Implement new techniques, tools, and architectures that enhance the performance, latency, throughput, and effectiveness of our model inference stack.Develop tools to identify bottlenecks and instability sources, designing and implementing solutions for priority issues.Optimize our code and Azure VM fleet to maximize every FLOP and GB of GPU RAM available.You Will Excel in This Role If You:Possess a solid understanding of modern machine learning architectures and an intuitive grasp of performance optimization strategies, especially for inference.Take ownership of problems end-to-end, demonstrating a willingness to acquire any necessary knowledge to achieve results.Bring at least 5 years of professional software engineering experience.Have or can quickly develop expertise in PyTorch, NVidia GPUs, and relevant optimization software stacks (such as NCCL, CUDA), along with HPC technologies like InfiniBand, MPI, and NVLink.Have experience in architecting, building, monitoring, and debugging production distributed systems, with bonus points for working on performance-critical systems.Have successfully rebuilt or significantly refactored production systems multiple times to accommodate rapid scaling.Are self-driven, enjoying the challenge of identifying and addressing the most critical problems.
Join Cloudflare as a Senior Software Engineer specializing in Workers Runtime. In this dynamic role, you will be instrumental in developing and enhancing our serverless platform, enabling developers worldwide to build and deploy applications seamlessly. You will leverage your expertise in programming languages, cloud technologies, and software architecture to craft innovative solutions that power the next generation of applications.We are looking for a passionate engineer who thrives in a collaborative environment and is eager to tackle complex challenges. You will work closely with cross-functional teams to deliver high-quality software and contribute to Cloudflare's mission of building a better internet.
Baseten creates AI inference solutions for clients such as Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer. The team blends AI research, infrastructure, and developer tools to help organizations deploy advanced models. Backed by $300M in Series E funding from BOND, IVP, Spark Capital, Greylock, and Conviction, Baseten is expanding quickly and shaping the landscape for engineers building AI products. Role overview The Software Engineer - Voice AI role centers on building and deploying open-source voice models for real-world use. Voice is becoming a key interface across the web, and this position addresses the technical challenges of bringing production-ready Voice AI to market. The work supports applications in productivity, customer service, clinical dialogue, creator tools, education, and more, helping to change how people interact with technology across sectors. This engineer leads Baseten’s Voice AI efforts, guiding the proprietary inference stack that powers Voice AI models. The role balances shaping the product roadmap with hands-on engineering. Collaboration is a core part of the job, working closely with Forward Deployed Engineers, Model Performance Engineers, and other technical teams to advance Voice AI capabilities. Sample projects and initiatives The world's fastest Whisper, with streaming and diarization Canopy Labs selects Baseten for Orpheus TTS inference Partnering with the Core Product team to build an orchestration framework for a multi-model voice agent Working with the Training Platform team to support ongoing training of voice models Designing a developer-friendly API and SDK to encourage self-service adoption of Baseten Voice AI products Location San Francisco
Full-time|On-site|Mountain View, California; San Francisco, California
Join Databricks as a Senior Engineering Manager specializing in AI Runtime. In this pivotal role, you will lead a talented team in developing innovative solutions that drive the future of artificial intelligence. Your expertise will guide the architecture and implementation of AI runtime components, delivering scalable and efficient systems that empower our clients to harness the full potential of AI technologies.Your responsibilities will include fostering a collaborative environment, mentoring team members, and ensuring the team's alignment with our strategic vision. You will collaborate closely with cross-functional teams to define and execute the roadmap for AI runtime, ensuring high performance and reliability.
OverviewAt Pulse, we are revolutionizing the way data infrastructure operates by addressing the critical challenge of accurately extracting structured information from intricate documents on a large scale. Our innovative document understanding technique merges intelligent schema mapping with advanced extraction models, outperforming traditional OCR and parsing methods.Located in the heart of San Francisco, we are a dynamic team of engineers dedicated to empowering Fortune 100 enterprises, YC startups, public investment firms, and growth-stage companies. Backed by top-tier investors, we are rapidly expanding our footprint in the industry.What sets our technology apart is our sophisticated multi-stage architecture, which includes:Specialized models for layout understanding and component detectionLow-latency OCR models designed for precise extractionAdvanced algorithms for reading-order in complex document structuresProprietary methods for table structure recognition and parsingFine-tuned vision-language models for interpreting charts, tables, and figuresIf you possess a strong passion for the convergence of computer vision, natural language processing, and data infrastructure, your contributions at Pulse will significantly impact our clients and help shape the future of document intelligence.
Join our innovative team at Anthropic as a Software Engineer specializing in Cloud Inference Safeguards. In this role, you will play a crucial part in developing and enhancing the systems that ensure the robustness and security of our cloud-based inference services. You will collaborate with cross-functional teams to design, implement, and maintain scalable solutions that meet our high standards for reliability and performance.
Full-time|$190.9K/yr - $232.8K/yr|On-site|San Francisco, California
P-1285 About This Role Join Databricks as a Staff Software Engineer specializing in GenAI inference, where you will spearhead the architecture, development, and optimization of the inference engine that powers the Databricks Foundation Model API. Your role will be crucial in bridging cutting-edge research with real-world production requirements, ensuring exceptional throughput, minimal latency, and scalable solutions. You will work across the entire GenAI inference stack, including kernels, runtimes, orchestration, memory management, and integration with various frameworks and orchestration systems. What You Will Do Take full ownership of the architecture, design, and implementation of the inference engine, collaborating on a model-serving stack optimized for large-scale LLM inference. Work closely with researchers to integrate new model architectures or features, such as sparsity, activation compression, and mixture-of-experts into the engine. Lead comprehensive optimization efforts focused on latency, throughput, memory efficiency, and hardware utilization across GPUs and other accelerators. Establish and uphold standards for building and maintaining instrumentation, profiling, and tracing tools to identify performance bottlenecks and drive optimizations. Design scalable solutions for routing, batching, scheduling, memory management, and dynamic loading tailored to inference workloads. Guarantee reliability, reproducibility, and fault tolerance in inference pipelines, including capabilities for A/B testing, rollbacks, and model versioning. Collaborate cross-functionally to integrate with federated and distributed inference infrastructure, ensuring effective orchestration across nodes, load balancing, and minimizing communication overhead. Foster collaboration with cross-functional teams, including platform engineers, cloud infrastructure, and security/compliance professionals. Represent the team externally through benchmarks, whitepapers, and contributions to open-source projects. What We Look For A BS/MS/PhD in Computer Science or a related discipline. A solid software engineering background with 6+ years of experience in performance-critical systems. A proven ability to own complex system components and influence architectural decisions from conception to execution. A deep understanding of ML inference internals, including attention mechanisms, MLPs, recurrent modules, quantization, and sparse operations. Hands-on experience with CUDA, GPU programming, and essential libraries (cuBLAS, cuDNN, NCCL, etc.). A strong foundation in distributed systems design, including RPC frameworks, queuing, RPC batching, sharding, and memory partitioning. Demonstrated proficiency in diagnosing and resolving performance bottlenecks across multiple layers (kernel, memory, networking, scheduler).
Full-time|$165K/yr - $500K/yr|On-site|San Francisco, CA
Join the Fluidstack TeamAt Fluidstack, we’re pioneering the infrastructure for advanced intelligence. We collaborate with leading AI laboratories, governmental entities, and major corporations—including Mistral, Poolside, and Meta—to deliver computing solutions at unprecedented speeds.Our mission is to transform the vision of Artificial General Intelligence (AGI) into a reality. Driven by our purpose, our dedicated team is committed to building state-of-the-art infrastructure that prioritizes our customers' success. If you share our passion for excellence and are eager to contribute to the future of intelligence, we invite you to be part of our journey.Role OverviewThe Inference Platform team at Fluidstack is at the forefront of addressing the cost and latency challenges associated with frontier AI. You will play a crucial role in managing the serving layer that connects our global accelerator supply with the production workloads of our clients, which include LLM serving frameworks, KV cache infrastructure, and Kubernetes orchestration across multiple data centers.This hands-on individual contributor role combines elements of distributed systems, model optimization, and serving infrastructure. You will oversee the entire lifecycle of inference deployments for leading AI labs, striving for enhancements in throughput, cost-efficiency, and response times, while also influencing the architectural decisions that guide Fluidstack’s deployment strategies.
Baseten develops infrastructure and tools that help AI companies deploy and scale inference. Teams at organizations like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer rely on Baseten to bring advanced machine learning models into production. The company recently secured a $300M Series E from investors including BOND, IVP, Spark Capital, Greylock, and Conviction. Role overview This Software Engineer - GPU Inference position joins the founding team for Baseten Voice AI in San Francisco. The team focuses on building production-ready Voice AI systems, bringing open-source voice models into real-world use for clients in productivity, customer service, healthcare conversations, and education. The work shapes how people interact with technology through voice, creating broad impact across industries. In this role, the engineer leads the internal inference stack that powers Voice AI models. Responsibilities include guiding the product roadmap and driving engineering execution. Collaboration is a key part of the job, working closely with Forward Deployed Engineers, Model Performance Engineers, and other technical groups to advance Voice AI capabilities. Sample projects and initiatives The world's fastest Whisper, with streaming and diarization Canopy Labs selects Baseten for Orpheus TTS inference Partnering with the Core Product team to build an orchestration framework for a multi-model voice agent Working with the Training Platform team to support continuous training of voice models Designing a developer-friendly API and SDK for self-service adoption of Baseten Voice AI products
About Aqua VoiceAqua Voice is pioneering the voice input landscape for the AI era. By training our own models and creating deep OS integrations, we ensure that voice technology is executed flawlessly across the entire stack.The evolution of work is upon us; individual contributor roles are transforming as we now manage AI agents, a task that naturally aligns with voice interactions.We are not focused on traditional conversational agents; instead, we champion a voice-in-text-out (VITO) approach, believing that voice functionality should operate above traditional applications, and that innovative startups can lead in this domain.With an unwavering commitment to this vision, our progress has been promising, and we invite you to be a part of our journey.The RoleAs a growing team, you will have the opportunity to engage with all facets of our system.Recent projects include:Development of a real-time transcription server capable of managing thousands of concurrent audio streams.Training and deploying custom speech recognition models.Creating native integrations for macOS and Windows utilizing advanced system APIs.Technology StackFrontend: TypeScript, React, Next.js, ElectronBackend: Python, real-time server (Bun/Node.js), WebSocketsNative: Swift (macOS), C# (Windows)Machine Learning: Custom speech recognition models, inference pipelinesInfrastructure: Terraform, StripePrior experience with all these technologies is not required.Qualifications- Showcase your previous work and projects.- Adaptable in switching between programming languages and domains.- Capable of taking ownership of projects from concept to production.- Proficient in writing clean and maintainable code.- Experience with production systems is essential.- A positive attitude and a collaborative spirit are a must!
Full-time|$200K/yr - $400K/yr|Remote|San Francisco
At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, revolutionizing AI progress by making inference both more accessible and efficient. Our founding team consists of the original creators and key maintainers of vLLM, positioning us uniquely at the nexus of cutting-edge models and advanced hardware.Role OverviewWe are seeking a passionate inference runtime engineer eager to explore and expand the frontiers of LLM and diffusion model serving. As models evolve and grow in complexity with new architectures like mixture-of-experts and multimodal designs, the demand for innovative solutions in our inference engine intensifies. This role places you at the heart of vLLM, where you will enhance model execution across a variety of hardware platforms and architectures. Your contributions will have a direct influence on the future of AI inference.
About Retell AIAt Retell AI, we are pioneering the future of call centers through innovative voice AI technology. Our cutting-edge solutions are transforming how companies engage with customers.In just 18 months since our inception, we've empowered thousands of businesses with our AI voice agents that efficiently manage sales, support, and logistics calls, significantly reducing the need for large teams of human agents. Supported by industry-leading investors including Y Combinator and Alt Capital, we've grown our annual recurring revenue from $5M to an impressive $36M while expanding our team from 5 to 20 talented individuals since 2025.Our ambitious vision for 2026 is to develop a state-of-the-art customer experience platform where entire contact centers are driven by AI. Unlike basic automation requiring constant human oversight, we’re engineering intelligent AI “workers” capable of serving as frontline agents, quality assurance analysts, and managerial roles, all while optimizing customer interactions continuously.We are rapidly expanding and seeking driven builders who thrive on solving complex technical challenges, act decisively, and wish to make a tangible impact in one of the fastest-growing voice AI startups.Join us in shaping the future!Recognized as a top 50 AI application in the a16z list: https://tinyurl.com/5853dt2xRanked #4 in Brex's Fast-Growing Software Vendors of 2025: https://www.brex.com/journal/brex-benchmark-december-2025Featured among the top startups on: https://leanaileaderboard.com/
About Our TeamThe Inference team at OpenAI is dedicated to translating our cutting-edge research into accessible, transformative technology for consumers, enterprises, and developers. By leveraging our advanced AI models, we enable users to achieve unprecedented levels of innovation and productivity. Our primary focus lies in enhancing model inference efficiency and accelerating progress in research through optimized inference capabilities.About the RoleWe are seeking talented engineers to expand and optimize OpenAI's inference infrastructure, specifically targeting emerging GPU platforms. This role encompasses a wide range of responsibilities from low-level kernel optimization to high-level distributed execution. You will collaborate closely with our research, infrastructure, and performance teams to ensure seamless operation of our largest models on cutting-edge hardware.This position offers a unique opportunity to influence and advance OpenAI’s multi-platform inference capabilities, with a strong emphasis on optimizing performance for AMD accelerators.Your Responsibilities Include:Overseeing the deployment, accuracy, and performance of the OpenAI inference stack on AMD hardware.Integrating our internal model-serving infrastructure (e.g., vLLM, Triton) into diverse GPU-backed systems.Debugging and optimizing distributed inference workloads across memory, network, and compute layers.Validating the correctness, performance, and scalability of model execution on extensive GPU clusters.Collaborating with partner teams to design and optimize high-performance GPU kernels for accelerators utilizing HIP, Triton, or other performance-centric frameworks.Working with partner teams to develop, integrate, and fine-tune collective communication libraries (e.g., RCCL) to parallelize model execution across multiple GPUs.Ideal Candidates Will:Possess experience in writing or porting GPU kernels using HIP, CUDA, or Triton, with a strong focus on low-level performance.Be familiar with communication libraries like NCCL/RCCL, understanding their importance in high-throughput model serving.Have experience with distributed inference systems and be adept at scaling models across multiple accelerators.Enjoy tackling end-to-end performance challenges across hardware, system libraries, and orchestration layers.Be eager to join a dynamic, agile team focused on building innovative infrastructure from the ground up.
Full-time|$142.2K/yr - $204.6K/yr|On-site|San Francisco, California
About This Role Join Databricks as a Software Engineer focused on GenAI inference, where you will play a pivotal role in designing, developing, and enhancing the inference engine that drives our Foundation Model API. Collaborating at the intersection of research and production, you will ensure our large language model (LLM) serving systems are optimized for speed, scalability, and efficiency. Your contributions will span the entire GenAI inference stack, from kernels and runtimes to orchestration and memory management. What You Will Do Participate in the design and implementation of the inference engine, collaborating on a model-serving stack tailored for large-scale LLM inference. Work closely with researchers to integrate new model architectures or features such as sparsity, activation compression, and mixture-of-experts into the engine. Optimize latency, throughput, memory efficiency, and hardware utilization across GPUs and other accelerators. Build and maintain tools for instrumentation, profiling, and tracing to identify bottlenecks and inform optimization efforts. Develop scalable routing, batching, scheduling, memory management, and dynamic loading mechanisms for inference workloads. Ensure reliability, reproducibility, and fault tolerance in inference pipelines, including A/B launches, rollback, and model versioning. Integrate with federated and distributed inference infrastructure, orchestrating across nodes, balancing load, and managing communication overhead. Engage in cross-functional collaboration with platform engineers, cloud infrastructure, and security/compliance teams. Document and share insights, contributing to internal best practices and open-source initiatives as appropriate.
Baseten seeks a Software Engineer to focus on Voice AI within the Inference Runtime team. This San Francisco-based role centers on building and refining AI models that power voice interaction features in Baseten products. Role overview This position involves hands-on work with AI models for voice-driven applications. The engineer will help shape how users interact with voice technology by developing and optimizing the underlying systems. What you will do Develop and optimize AI models for applications that use voice as a primary interface Work on inference runtime systems to support responsive and intelligent voice experiences Contribute directly to Baseten's products, influencing the future of voice technology for users Location This position is based in San Francisco.
About UsAt Physical Intelligence, we are pioneers in integrating general-purpose AI into the physical realm. Our team comprises dedicated engineers, scientists, roboticists, and innovators focused on creating advanced foundation models and learning algorithms that will empower the robots of today and the physically-actuated devices of tomorrow.To achieve outstanding real-world performance, we prioritize ultra-low system latency, reliable sensor pipelines, and comprehensive engineering that ensures perception and control loops function seamlessly at real-time speeds.As a Runtime Software Engineer, you will be at the forefront of developing low-latency, high-throughput systems that support our physical intelligence model. Your role will not involve designing ML models; instead, you will optimize the entire stack from the operating system to the camera pipeline and networking, ensuring flawless production execution. You will work closely with researchers, platform engineers, and robotics operators to identify performance bottlenecks and maximize system efficiency.The TeamOur Runtime team is crucial in creating the foundational platform that our robots, sensors, and evaluation systems depend on. This team excels in Linux systems engineering, camera and sensor integration, robot actuator control, networking, real-time input/output, and performance optimization tools. They ensure that our machine learning models and control systems function within strict latency constraints and are resilient under real-world conditions.Your ResponsibilitiesManage Real-Time Pipelines: Design and implement low-latency, high-reliability sensor and actuator pipelines utilizing Linux, drivers, and middleware.Enhance System Performance: Analyze and optimize computational efficiency across I/O, memory, scheduling, networking, and storage to satisfy real-time requirements and boost throughput.Develop OS-Level Features: Modify or expand Linux components, drivers, and scheduling mechanisms to ensure deterministic performance under load.Streaming & Video Systems: Create and refine real-time video streaming systems with precise frame timing and packet scheduling.Ensure Reliability & Debugging: Develop tools for profiling, tracing, and resolving timing challenges across distributed systems and hardware interfaces.Collaborate Across Functions: Partner with researchers, hardware engineers, and operational teams to drive system performance and reliability.
Full-time|$300K/yr - $300K/yr|On-site|San Francisco
ABOUT BASETENAt Baseten, we empower the leading AI companies of today, including Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer, by providing essential inference capabilities. Our unique blend of applied AI research, adaptable infrastructure, and intuitive developer tools enables innovators at the cutting edge of AI to seamlessly transition advanced models into production. With our recent success in securing a $300M Series E funding round, backed by notable investors such as BOND, IVP, Spark Capital, Greylock, and Conviction, we're on an exciting growth trajectory. Join our team and contribute to the platform that engineers rely on to launch AI-driven products.THE ROLEAs an Applied AI Inference Engineer at Baseten, you'll collaborate closely with clients to design, develop, and implement high-performance AI applications using our platform. You will guide customers through the entire process, from initial concept to deployment, transforming vague business objectives into dependable, observable solutions that meet defined quality, latency, and cost metrics.This position is ideal for innovative engineers eager to gain insight into how modern organizations scale AI adoption. You will thrive if you enjoy a multifaceted role that intersects product development, software engineering, performance optimization, and direct customer engagement.It’s essential to note that this position requires hands-on coding and software development, while also encompassing elements of product management, technical customer success, and pre-sales engineering.EXAMPLE INITIATIVESExplore insights from our Forward Deployed Engineering team through these blog posts: Forward Deployed Engineering on the frontier of AIThe fastest, most accurate Whisper transcriptionDeploy production-ready model servers from Docker imagesDeploy custom ComfyUI workflows as APIs...
Join Sysdig as a Senior Software Engineer specializing in Runtime. In this role, you'll be at the forefront of developing innovative solutions that enhance our runtime security and monitoring products.You'll collaborate with a dynamic team, leveraging cutting-edge technologies to tackle complex challenges. We are looking for passionate engineers who thrive in a fast-paced environment and are eager to make a significant impact.
At Lemurian Labs, we’re dedicated to democratizing the power of AI while ensuring a minimal environmental impact. Our focus is on the profound influence AI has on society and the planet, and we’re committed to creating a robust foundation for its future, promoting sustainable and responsible AI growth. After all, innovation should benefit the world.We are developing a state-of-the-art, high-performance compiler that allows developers to 'build once, deploy anywhere.' This means seamless cross-platform functionality—train your models in the cloud and deploy them at the edge, all while optimizing for resource efficiency and scalability.If you're passionate about sustainably scaling AI and excited about making powerful AI development accessible, we invite you to join our team at Lemurian Labs. Be part of a fun and innovative environment where you can help shape the future without leaving a mess behind.
About Our TeamJoin OpenAI’s dynamic Inference team, where we empower the deployment of cutting-edge AI models, including our renowned GPT models, advanced Image Generation capabilities, and Whisper, across diverse platforms. Our mission is to ensure these models are not only high-performing and scalable but also available for real-world applications. Collaborating closely with our Research team, we’re committed to bringing the next generation of AI innovations to fruition. As a compact, agile team, we prioritize delivering an exceptional developer experience while continuously pushing the frontiers of artificial intelligence.As we expand our focus into multimodal inference, we are building the necessary infrastructure to support models that process images, audio, and other non-text modalities. This work involves tackling diverse model sizes and interactions, managing complex input/output formats, and ensuring seamless collaboration between product and research teams.About The RoleWe are seeking a passionate Software Engineer to aid in the large-scale deployment of OpenAI’s multimodal models. You will join a small yet impactful team dedicated to creating robust, high-performance infrastructure for real-time audio, image, and various multimodal workloads in production environments.This position is inherently collaborative; you will work directly with researchers who develop these models and with product teams to define novel interaction modalities. Your contributions will enable users to generate speech, interpret images, and engage with models in innovative ways that extend beyond traditional text-based interactions.Key Responsibilities:Design and implement advanced inference infrastructure for large-scale multimodal models.Optimize systems for high-throughput and low-latency processing of image and audio inputs and outputs.Facilitate the transition of experimental research workflows into dependable production services.Engage closely with researchers, infrastructure teams, and product engineers to deploy state-of-the-art capabilities.Contribute to systemic enhancements, including GPU utilization, tensor parallelism, and hardware abstraction layers.You May Excel In This Role If You:Have a proven track record of building and scaling inference systems for large language models or multimodal architectures.Possess experience with GPU-based machine learning workloads and a solid understanding of the performance dynamics associated with large models, particularly with intricate data types like images or audio.Thrive in a fast-paced, experimental environment and enjoy collaborating with cross-functional teams to drive impactful results.
About Our TeamJoin the Inference team at OpenAI, where we leverage cutting-edge research and technology to deliver exceptional AI products to consumers, enterprises, and developers. Our mission is to empower users to harness the full potential of our advanced AI models, enabling unprecedented capabilities. We prioritize efficient and high-performance model inference while accelerating research advancements.About the RoleWe are seeking a passionate Software Engineer to optimize some of the world's largest and most sophisticated AI models for deployment in high-volume, low-latency, and highly available production and research environments.Key ResponsibilitiesCollaborate with machine learning researchers, engineers, and product managers to transition our latest technologies into production.Work closely with researchers to enable advanced research initiatives through innovative engineering solutions.Implement new techniques, tools, and architectures that enhance the performance, latency, throughput, and effectiveness of our model inference stack.Develop tools to identify bottlenecks and instability sources, designing and implementing solutions for priority issues.Optimize our code and Azure VM fleet to maximize every FLOP and GB of GPU RAM available.You Will Excel in This Role If You:Possess a solid understanding of modern machine learning architectures and an intuitive grasp of performance optimization strategies, especially for inference.Take ownership of problems end-to-end, demonstrating a willingness to acquire any necessary knowledge to achieve results.Bring at least 5 years of professional software engineering experience.Have or can quickly develop expertise in PyTorch, NVidia GPUs, and relevant optimization software stacks (such as NCCL, CUDA), along with HPC technologies like InfiniBand, MPI, and NVLink.Have experience in architecting, building, monitoring, and debugging production distributed systems, with bonus points for working on performance-critical systems.Have successfully rebuilt or significantly refactored production systems multiple times to accommodate rapid scaling.Are self-driven, enjoying the challenge of identifying and addressing the most critical problems.
Join Cloudflare as a Senior Software Engineer specializing in Workers Runtime. In this dynamic role, you will be instrumental in developing and enhancing our serverless platform, enabling developers worldwide to build and deploy applications seamlessly. You will leverage your expertise in programming languages, cloud technologies, and software architecture to craft innovative solutions that power the next generation of applications.We are looking for a passionate engineer who thrives in a collaborative environment and is eager to tackle complex challenges. You will work closely with cross-functional teams to deliver high-quality software and contribute to Cloudflare's mission of building a better internet.
Baseten creates AI inference solutions for clients such as Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer. The team blends AI research, infrastructure, and developer tools to help organizations deploy advanced models. Backed by $300M in Series E funding from BOND, IVP, Spark Capital, Greylock, and Conviction, Baseten is expanding quickly and shaping the landscape for engineers building AI products. Role overview The Software Engineer - Voice AI role centers on building and deploying open-source voice models for real-world use. Voice is becoming a key interface across the web, and this position addresses the technical challenges of bringing production-ready Voice AI to market. The work supports applications in productivity, customer service, clinical dialogue, creator tools, education, and more, helping to change how people interact with technology across sectors. This engineer leads Baseten’s Voice AI efforts, guiding the proprietary inference stack that powers Voice AI models. The role balances shaping the product roadmap with hands-on engineering. Collaboration is a core part of the job, working closely with Forward Deployed Engineers, Model Performance Engineers, and other technical teams to advance Voice AI capabilities. Sample projects and initiatives The world's fastest Whisper, with streaming and diarization Canopy Labs selects Baseten for Orpheus TTS inference Partnering with the Core Product team to build an orchestration framework for a multi-model voice agent Working with the Training Platform team to support ongoing training of voice models Designing a developer-friendly API and SDK to encourage self-service adoption of Baseten Voice AI products Location San Francisco
Full-time|On-site|Mountain View, California; San Francisco, California
Join Databricks as a Senior Engineering Manager specializing in AI Runtime. In this pivotal role, you will lead a talented team in developing innovative solutions that drive the future of artificial intelligence. Your expertise will guide the architecture and implementation of AI runtime components, delivering scalable and efficient systems that empower our clients to harness the full potential of AI technologies.Your responsibilities will include fostering a collaborative environment, mentoring team members, and ensuring the team's alignment with our strategic vision. You will collaborate closely with cross-functional teams to define and execute the roadmap for AI runtime, ensuring high performance and reliability.
OverviewAt Pulse, we are revolutionizing the way data infrastructure operates by addressing the critical challenge of accurately extracting structured information from intricate documents on a large scale. Our innovative document understanding technique merges intelligent schema mapping with advanced extraction models, outperforming traditional OCR and parsing methods.Located in the heart of San Francisco, we are a dynamic team of engineers dedicated to empowering Fortune 100 enterprises, YC startups, public investment firms, and growth-stage companies. Backed by top-tier investors, we are rapidly expanding our footprint in the industry.What sets our technology apart is our sophisticated multi-stage architecture, which includes:Specialized models for layout understanding and component detectionLow-latency OCR models designed for precise extractionAdvanced algorithms for reading-order in complex document structuresProprietary methods for table structure recognition and parsingFine-tuned vision-language models for interpreting charts, tables, and figuresIf you possess a strong passion for the convergence of computer vision, natural language processing, and data infrastructure, your contributions at Pulse will significantly impact our clients and help shape the future of document intelligence.
Join our innovative team at Anthropic as a Software Engineer specializing in Cloud Inference Safeguards. In this role, you will play a crucial part in developing and enhancing the systems that ensure the robustness and security of our cloud-based inference services. You will collaborate with cross-functional teams to design, implement, and maintain scalable solutions that meet our high standards for reliability and performance.
Full-time|$190.9K/yr - $232.8K/yr|On-site|San Francisco, California
P-1285 About This Role Join Databricks as a Staff Software Engineer specializing in GenAI inference, where you will spearhead the architecture, development, and optimization of the inference engine that powers the Databricks Foundation Model API. Your role will be crucial in bridging cutting-edge research with real-world production requirements, ensuring exceptional throughput, minimal latency, and scalable solutions. You will work across the entire GenAI inference stack, including kernels, runtimes, orchestration, memory management, and integration with various frameworks and orchestration systems. What You Will Do Take full ownership of the architecture, design, and implementation of the inference engine, collaborating on a model-serving stack optimized for large-scale LLM inference. Work closely with researchers to integrate new model architectures or features, such as sparsity, activation compression, and mixture-of-experts into the engine. Lead comprehensive optimization efforts focused on latency, throughput, memory efficiency, and hardware utilization across GPUs and other accelerators. Establish and uphold standards for building and maintaining instrumentation, profiling, and tracing tools to identify performance bottlenecks and drive optimizations. Design scalable solutions for routing, batching, scheduling, memory management, and dynamic loading tailored to inference workloads. Guarantee reliability, reproducibility, and fault tolerance in inference pipelines, including capabilities for A/B testing, rollbacks, and model versioning. Collaborate cross-functionally to integrate with federated and distributed inference infrastructure, ensuring effective orchestration across nodes, load balancing, and minimizing communication overhead. Foster collaboration with cross-functional teams, including platform engineers, cloud infrastructure, and security/compliance professionals. Represent the team externally through benchmarks, whitepapers, and contributions to open-source projects. What We Look For A BS/MS/PhD in Computer Science or a related discipline. A solid software engineering background with 6+ years of experience in performance-critical systems. A proven ability to own complex system components and influence architectural decisions from conception to execution. A deep understanding of ML inference internals, including attention mechanisms, MLPs, recurrent modules, quantization, and sparse operations. Hands-on experience with CUDA, GPU programming, and essential libraries (cuBLAS, cuDNN, NCCL, etc.). A strong foundation in distributed systems design, including RPC frameworks, queuing, RPC batching, sharding, and memory partitioning. Demonstrated proficiency in diagnosing and resolving performance bottlenecks across multiple layers (kernel, memory, networking, scheduler).
Full-time|$165K/yr - $500K/yr|On-site|San Francisco, CA
Join the Fluidstack TeamAt Fluidstack, we’re pioneering the infrastructure for advanced intelligence. We collaborate with leading AI laboratories, governmental entities, and major corporations—including Mistral, Poolside, and Meta—to deliver computing solutions at unprecedented speeds.Our mission is to transform the vision of Artificial General Intelligence (AGI) into a reality. Driven by our purpose, our dedicated team is committed to building state-of-the-art infrastructure that prioritizes our customers' success. If you share our passion for excellence and are eager to contribute to the future of intelligence, we invite you to be part of our journey.Role OverviewThe Inference Platform team at Fluidstack is at the forefront of addressing the cost and latency challenges associated with frontier AI. You will play a crucial role in managing the serving layer that connects our global accelerator supply with the production workloads of our clients, which include LLM serving frameworks, KV cache infrastructure, and Kubernetes orchestration across multiple data centers.This hands-on individual contributor role combines elements of distributed systems, model optimization, and serving infrastructure. You will oversee the entire lifecycle of inference deployments for leading AI labs, striving for enhancements in throughput, cost-efficiency, and response times, while also influencing the architectural decisions that guide Fluidstack’s deployment strategies.
Baseten develops infrastructure and tools that help AI companies deploy and scale inference. Teams at organizations like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer rely on Baseten to bring advanced machine learning models into production. The company recently secured a $300M Series E from investors including BOND, IVP, Spark Capital, Greylock, and Conviction. Role overview This Software Engineer - GPU Inference position joins the founding team for Baseten Voice AI in San Francisco. The team focuses on building production-ready Voice AI systems, bringing open-source voice models into real-world use for clients in productivity, customer service, healthcare conversations, and education. The work shapes how people interact with technology through voice, creating broad impact across industries. In this role, the engineer leads the internal inference stack that powers Voice AI models. Responsibilities include guiding the product roadmap and driving engineering execution. Collaboration is a key part of the job, working closely with Forward Deployed Engineers, Model Performance Engineers, and other technical groups to advance Voice AI capabilities. Sample projects and initiatives The world's fastest Whisper, with streaming and diarization Canopy Labs selects Baseten for Orpheus TTS inference Partnering with the Core Product team to build an orchestration framework for a multi-model voice agent Working with the Training Platform team to support continuous training of voice models Designing a developer-friendly API and SDK for self-service adoption of Baseten Voice AI products
About Aqua VoiceAqua Voice is pioneering the voice input landscape for the AI era. By training our own models and creating deep OS integrations, we ensure that voice technology is executed flawlessly across the entire stack.The evolution of work is upon us; individual contributor roles are transforming as we now manage AI agents, a task that naturally aligns with voice interactions.We are not focused on traditional conversational agents; instead, we champion a voice-in-text-out (VITO) approach, believing that voice functionality should operate above traditional applications, and that innovative startups can lead in this domain.With an unwavering commitment to this vision, our progress has been promising, and we invite you to be a part of our journey.The RoleAs a growing team, you will have the opportunity to engage with all facets of our system.Recent projects include:Development of a real-time transcription server capable of managing thousands of concurrent audio streams.Training and deploying custom speech recognition models.Creating native integrations for macOS and Windows utilizing advanced system APIs.Technology StackFrontend: TypeScript, React, Next.js, ElectronBackend: Python, real-time server (Bun/Node.js), WebSocketsNative: Swift (macOS), C# (Windows)Machine Learning: Custom speech recognition models, inference pipelinesInfrastructure: Terraform, StripePrior experience with all these technologies is not required.Qualifications- Showcase your previous work and projects.- Adaptable in switching between programming languages and domains.- Capable of taking ownership of projects from concept to production.- Proficient in writing clean and maintainable code.- Experience with production systems is essential.- A positive attitude and a collaborative spirit are a must!
Full-time|$200K/yr - $400K/yr|Remote|San Francisco
At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, revolutionizing AI progress by making inference both more accessible and efficient. Our founding team consists of the original creators and key maintainers of vLLM, positioning us uniquely at the nexus of cutting-edge models and advanced hardware.Role OverviewWe are seeking a passionate inference runtime engineer eager to explore and expand the frontiers of LLM and diffusion model serving. As models evolve and grow in complexity with new architectures like mixture-of-experts and multimodal designs, the demand for innovative solutions in our inference engine intensifies. This role places you at the heart of vLLM, where you will enhance model execution across a variety of hardware platforms and architectures. Your contributions will have a direct influence on the future of AI inference.
About Retell AIAt Retell AI, we are pioneering the future of call centers through innovative voice AI technology. Our cutting-edge solutions are transforming how companies engage with customers.In just 18 months since our inception, we've empowered thousands of businesses with our AI voice agents that efficiently manage sales, support, and logistics calls, significantly reducing the need for large teams of human agents. Supported by industry-leading investors including Y Combinator and Alt Capital, we've grown our annual recurring revenue from $5M to an impressive $36M while expanding our team from 5 to 20 talented individuals since 2025.Our ambitious vision for 2026 is to develop a state-of-the-art customer experience platform where entire contact centers are driven by AI. Unlike basic automation requiring constant human oversight, we’re engineering intelligent AI “workers” capable of serving as frontline agents, quality assurance analysts, and managerial roles, all while optimizing customer interactions continuously.We are rapidly expanding and seeking driven builders who thrive on solving complex technical challenges, act decisively, and wish to make a tangible impact in one of the fastest-growing voice AI startups.Join us in shaping the future!Recognized as a top 50 AI application in the a16z list: https://tinyurl.com/5853dt2xRanked #4 in Brex's Fast-Growing Software Vendors of 2025: https://www.brex.com/journal/brex-benchmark-december-2025Featured among the top startups on: https://leanaileaderboard.com/
About Our TeamThe Inference team at OpenAI is dedicated to translating our cutting-edge research into accessible, transformative technology for consumers, enterprises, and developers. By leveraging our advanced AI models, we enable users to achieve unprecedented levels of innovation and productivity. Our primary focus lies in enhancing model inference efficiency and accelerating progress in research through optimized inference capabilities.About the RoleWe are seeking talented engineers to expand and optimize OpenAI's inference infrastructure, specifically targeting emerging GPU platforms. This role encompasses a wide range of responsibilities from low-level kernel optimization to high-level distributed execution. You will collaborate closely with our research, infrastructure, and performance teams to ensure seamless operation of our largest models on cutting-edge hardware.This position offers a unique opportunity to influence and advance OpenAI’s multi-platform inference capabilities, with a strong emphasis on optimizing performance for AMD accelerators.Your Responsibilities Include:Overseeing the deployment, accuracy, and performance of the OpenAI inference stack on AMD hardware.Integrating our internal model-serving infrastructure (e.g., vLLM, Triton) into diverse GPU-backed systems.Debugging and optimizing distributed inference workloads across memory, network, and compute layers.Validating the correctness, performance, and scalability of model execution on extensive GPU clusters.Collaborating with partner teams to design and optimize high-performance GPU kernels for accelerators utilizing HIP, Triton, or other performance-centric frameworks.Working with partner teams to develop, integrate, and fine-tune collective communication libraries (e.g., RCCL) to parallelize model execution across multiple GPUs.Ideal Candidates Will:Possess experience in writing or porting GPU kernels using HIP, CUDA, or Triton, with a strong focus on low-level performance.Be familiar with communication libraries like NCCL/RCCL, understanding their importance in high-throughput model serving.Have experience with distributed inference systems and be adept at scaling models across multiple accelerators.Enjoy tackling end-to-end performance challenges across hardware, system libraries, and orchestration layers.Be eager to join a dynamic, agile team focused on building innovative infrastructure from the ground up.
Full-time|$142.2K/yr - $204.6K/yr|On-site|San Francisco, California
About This Role Join Databricks as a Software Engineer focused on GenAI inference, where you will play a pivotal role in designing, developing, and enhancing the inference engine that drives our Foundation Model API. Collaborating at the intersection of research and production, you will ensure our large language model (LLM) serving systems are optimized for speed, scalability, and efficiency. Your contributions will span the entire GenAI inference stack, from kernels and runtimes to orchestration and memory management. What You Will Do Participate in the design and implementation of the inference engine, collaborating on a model-serving stack tailored for large-scale LLM inference. Work closely with researchers to integrate new model architectures or features such as sparsity, activation compression, and mixture-of-experts into the engine. Optimize latency, throughput, memory efficiency, and hardware utilization across GPUs and other accelerators. Build and maintain tools for instrumentation, profiling, and tracing to identify bottlenecks and inform optimization efforts. Develop scalable routing, batching, scheduling, memory management, and dynamic loading mechanisms for inference workloads. Ensure reliability, reproducibility, and fault tolerance in inference pipelines, including A/B launches, rollback, and model versioning. Integrate with federated and distributed inference infrastructure, orchestrating across nodes, balancing load, and managing communication overhead. Engage in cross-functional collaboration with platform engineers, cloud infrastructure, and security/compliance teams. Document and share insights, contributing to internal best practices and open-source initiatives as appropriate.
Jan 30, 2026
Sign in to browse more jobs
Create account — see all 7,518 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.