Senior AI Infrastructure Engineer - Training Platform
Scale AISan Francisco, CA; Seattle, WA; New York, NYNew
On-site Full-time
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Senior
Qualifications
Proven experience in AI infrastructure engineering or a related field. Strong proficiency in cloud services (AWS, GCP, Azure) and container orchestration (Kubernetes, Docker). Solid understanding of machine learning frameworks such as TensorFlow or PyTorch. Experience with CI/CD pipelines and DevOps practices. Excellent problem-solving skills and ability to work in a fast-paced environment. Bachelor's degree in Computer Science, Engineering, or a related field.
About the job
Scale AI is seeking a Senior AI Infrastructure Engineer to help build and refine the company’s Training Platform. This position centers on designing, implementing, and improving infrastructure that supports machine learning teams as they train and deploy models.
Role overview
This engineer will work closely with colleagues across different functions to create solutions that make AI systems more efficient. The focus is on enabling faster, more reliable model training and deployment.
Key responsibilities
Design and build infrastructure for AI model training
Implement and optimize systems to support machine learning workflows
Collaborate with teams throughout the company to improve platform capabilities
Locations
This role is based in San Francisco, Seattle, or New York.
About Scale AI
Scale AI is at the forefront of AI innovation, helping companies harness the power of artificial intelligence to accelerate their operations. Our cutting-edge technologies and collaborative environment empower engineers to push boundaries and create impactful solutions that redefine industries.
About Us:At novita-ai, we are a rapidly growing global provider of AI cloud infrastructure, leading the charge in the artificial intelligence revolution. Our innovative platform equips developers and enterprises with powerful, scalable, and user-friendly solutions such as Model APIs, GPU Instances, and Serverless Computing. As organizations around the globe strive to integrate AI into their offerings, we serve as the essential engine that fuels their innovative efforts.Join our world-class team and contribute to our expanding customer base. This unique opportunity allows you to be part of a dynamic company in a hyper-growth market, where your technical skills will directly impact customer success and drive our business forward.The Role:As a Solutions Engineer, you will act as the primary technical leader and trusted advisor for our clients throughout their journey. You will collaborate closely with the sales team to bridge the gap between complex customer challenges and our sophisticated technical solutions. Your mission is to build technical credibility, demonstrate the capabilities of our platform, and design tailored solutions that empower our clients to achieve their AI-related business objectives.What You'll Do:Technical Discovery & Solution Design: Collaborate with Account Executives to gain a deep understanding of customer needs, technical requirements, and business goals. Develop elegant and effective solutions utilizing our AI infrastructure stack (Model APIs, GPU Instances, Serverless).Product Demonstration & Proof of Concept (POC): Conduct engaging, customized product demonstrations and interactive workshops. Plan, manage, and execute successful POCs, showcasing the value and performance of our platform within the client’s environment.Technical Evangelism & Trusted Advisory: Communicate the value proposition of our platform to diverse audiences, including both technical and non-technical stakeholders, from engineers to C-level executives. Establish yourself as the go-to expert for customers on best practices in AI infrastructure.Sales Enablement & Market Feedback Loop: Create and maintain technical sales materials, including whitepapers, best practice guides, and demo scripts. Serve as the voice of the customer, relaying valuable feedback from the field to our Product and Engineering teams to influence our product roadmap.Onboarding & Implementation Guidance: Facilitate a seamless post-sales transition by providing initial onboarding support and architectural guidance, setting customers up for sustained success.
Join Our MissionAt Hyperbolic Labs, we are dedicated to democratizing artificial intelligence by eliminating barriers to computing power through our Open-Access AI Cloud. We aggregate global computing resources to provide an innovative GPU marketplace and AI inference service, making AI affordable and accessible for everyone. As pioneers at the crossroads of AI and open-source technology, we envision a future where AI innovation is driven by imagination, not resource limitations. We invite forward-thinking individuals who share our vision of making AI universally accessible, secure, and cost-effective to join us in crafting a platform that empowers innovators to realize their groundbreaking AI projects.As we gear up for expansion following our Series A funding, our team, led by co-founders with PhDs in AI, Mathematics, and Computer Science, is set to transform the landscape of computing.The RoleWe are on the lookout for a Senior Infrastructure Engineer to drive the development and scaling of Hyperbolic's GPU Cloud Marketplace. In this pivotal role, you will create a multi-tenancy provisioning and virtualization solution that transforms raw GPUs from diverse global suppliers into a programmable, orchestrated resource pool serving thousands of AI developers and researchers. You will work at the forefront of cloud infrastructure, building the core orchestration layer that allows our platform to deliver cost savings of up to 75% compared to traditional cloud providers.
Join Our Team as an AI Infrastructure EngineerAt Spellbrush, the premier generative AI studio behind niji・journey, we are in search of a talented AI Infrastructure Engineer to help us develop and enhance our end-to-end machine learning infrastructure, facilitating the operation of our models across a variety of platforms.Key Responsibilities:Design, implement, and maintain next-generation inference architecture to optimize the performance of our models across mobile, web, and other platforms.Collaborate with a dynamic team focused on creating cutting-edge image generation models that serve over 16 million users globally.Ideal Candidate Profile:Experience with Large Distributed Systems: You possess a strong background in working with modern technologies such as Kubernetes (K8S), Kafka, NATS, Redis, among others. Your hands-on experience spans both on-premises and multi-cloud environments, and you understand the intricacies and potential pitfalls of each system.Expertise in GPU Workloads: Your understanding of GPU processing for handling substantial workloads sets you apart. Having experience in deploying or optimizing GPU workloads end-to-end is a significant advantage.Passion for Anime Aesthetics: As avid anime enthusiasts, we value team members who share our passion for the anime aesthetic, contributing to a creative movement that engages millions.Team Player in Fast-Paced Environments: You thrive in small, agile teams and are eager to work alongside some of the world's top AI researchers, contributing to the best image models globally. We believe in the power of in-person collaboration, with opportunities at our offices in Tokyo (downtown Akihabara) or San Francisco. Visa sponsorships are available.
Join the Revolution at Retell AIRetell AI is pioneering the future of call centers through innovative voice AI, driven by first principles thinking.In just 18 months since our inception, we have empowered thousands of businesses with our AI voice agents, transforming how sales, support, and logistics calls are managed—previously requiring extensive human teams. Supported by prestigious investors such as Y Combinator and Alt Capital, we've rapidly scaled from $5M ARR to an impressive $36M ARR with a compact yet dynamic team of 20.Our ambition for 2026 is to create a revolutionary customer experience platform, where entire contact centers are powered by AI. Moving beyond basic automation, we aim to develop intelligent AI “workers” that serve as frontline agents, QA analysts, and managers, continuously enhancing customer interactions without the need for constant human oversight.As we expand, we are seeking passionate engineers who are eager to solve challenging technical problems, act swiftly, and make a significant impact in one of the fastest-growing voice AI startups. Let’s shape the future together.
Full-time|On-site|San Francisco, California, United States
At Yutori, we are transforming the way individuals engage with the digital realm by developing AI agents capable of efficiently performing everyday online tasks. Our approach is to create a comprehensive, agent-first ecosystem, encompassing everything from training proprietary models to designing innovative generative product interfaces.To further this mission, we are seeking a skilled AI Engineer to join our pioneering team. Ideal candidates should possess strong technical expertise and a passion for crafting superhuman AI agents that can navigate the web autonomously.Our founders — Devi Parikh, Abhishek Das, and Dhruv Batra — bring a wealth of experience in AI research and product development, particularly in generative, multimodal, and embodied AI, honed during their time at Meta. Our team merges AI proficiency with a design-oriented approach to advance Yutori’s objectives.Yutori is proudly supported by a distinguished group of visionary investors, including Elad Gil, Sarah Guo, Jeff Dean, Fei-Fei Li, Amjad Masad, Guillermo Rauch, Akshay Kothari, Soleio, Oliver Cameron, Julien Chaumond, Logan Kilpatrick, Bryan McCann, Vladlen Koltun, Jamie Cuffe, Michele Catasta, and many others.
Full-time|$216.2K/yr - $270.3K/yr|On-site|San Francisco, CA; New York, NY
Join our dynamic Machine Learning Infrastructure team as a Senior AI Infrastructure Engineer, where you will play a pivotal role in designing and constructing platforms that ensure the scalable, reliable, and efficient serving of Large Language Models (LLMs). Our innovative platform supports a range of cutting-edge research and production systems, catering to both internal and external applications across diverse environments.The ideal candidate will possess a solid foundation in machine learning principles coupled with extensive experience in backend system architecture. You will thrive in a collaborative environment that bridges research and engineering, working diligently to provide seamless experiences for our customers and accelerating innovation across the organization.
Full-time|$138K/yr - $259.4K/yr|On-site|San Francisco, CA; St. Louis, MO; New York, NY; Washington, DC
Scale AI is on the lookout for an exceptionally talented and driven Software Engineer, Frontier AI Infrastructure to become an integral part of our innovative Public Sector Engineering team. In this role, you will take charge of the model inference layer, enabling cutting-edge AI models, troubleshooting the latest AI tools, managing networking tasks, addressing latency issues, and monitoring pricing and usage metrics for AI models. You will spearhead technical discussions with cloud vendors and clients to fulfill critical contracts and resolve platform challenges. Additionally, you will collaborate closely with Product teams to anticipate feature requirements, transitioning from reactive 'infra-only debugging' to proactive integration testing.Your Responsibilities Include:Designing and implementing secure, scalable backend systems tailored for Public Sector clients, utilizing Scale's advanced cloud-native AI infrastructure.Owning services or systems while defining long-term health objectives and enhancing the health of related components.Redesigning the architecture to operate in compliant or restrictive environments, which entails creating swappable components (authentication, storage, logging) to adhere to government and security regulations without compromising product integrity.Collaborating with Product teams to develop integration tests that identify issues early, shifting focus from 'infra-only debugging' to preventing upstream failures.Actively participating in customer engagements, liaising with stakeholders to comprehend requirements and deliver innovative solutions.Contributing to the platform roadmap and product strategy for Scale AI's Public Sector division, playing a vital role in shaping the future trajectory of our offerings.
Full-time|Remote|North America Remote / San Francisco, CA
Join Our Team as a Software Engineer - AI InfrastructureLocation: North America Remote / San Francisco · Full-TimeAt Andromeda Cluster, we are dedicated to democratizing access to advanced AI infrastructure that was once only available to hyperscalers. Founded by industry leaders Nat Friedman and Daniel Gross, we have evolved from a singular managed cluster to a global platform that connects top AI labs, data centers, and cloud providers around the world. Our orchestration layer efficiently manages training and inference tasks globally, enhancing flexibility and efficiency in this rapidly expanding sector. We aim to create a global marketplace for AI computing, empowering AGI with the same fluidity as global financial markets.As we continue to grow, we are on the lookout for talented individuals in the fields of AI infrastructure, research, and engineering.Your RoleIn the position of Infrastructure Product Engineer, you will be integral in constructing the foundational framework of Andromeda’s platform. Your challenge will be to simplify complex, real-world infrastructure issues into scalable product solutions that our customers will benefit from.Key ResponsibilitiesArchitect and develop essential platform components, focusing on infrastructure orchestration, provisioning, and lifecycle management solutions.Create robust APIs, services, and control planes that abstract diverse infrastructure types, including VMs, Kubernetes, bare metal, and schedulers.Convert customer usage patterns into actionable product requirements, delivering impactful features and enhancements.Design automation and internal tools to mitigate manual and ad-hoc operational tasks.Improve platform reliability, performance, and observability, focusing on sustainable enhancements rather than quick fixes.Collaborate with other teams to establish clear ownership boundaries between platform features and customer-specific solutions.Write clean, maintainable, and well-documented code with a focus on long-term sustainability.Engage in technical design discussions and contribute to the architectural advancements of our platform.
Genesis Molecular AI is building the GEMS molecular AI platform, driving advances in foundation model training and industrial screening. Strategic partnerships and a strong compute infrastructure are central to the company’s growth and mission. Role Overview The Director of AI Infrastructure Partnerships will lead efforts to secure and manage critical technology alliances, investments, and compute resources. This leader will work closely with top AI organizations, hardware providers, and investors, including firms like a16z and NVIDIA, to support Genesis’s technical and business goals. The role is based in either New York City or the San Francisco Bay Area. What You Will Do Oversee partnerships with NVIDIA and identify new opportunities with leading AI organizations. Structure contracts, equity deals, technical collaborations, co-publications, and data-sharing agreements for both public and proprietary experimental and synthetic data. Create presentations and written materials that clearly communicate Genesis’s platform vision and technical strengths to partners and investors, and integrate these messages into broader external communications. Serve as the business lead and chief negotiator for major cloud computing and AI infrastructure deals. Secure high-performance compute at competitive rates and maintain strong relationships with key partners. Monitor the AI compute market, evaluating providers for cost, reliability, and availability to support research and deployment needs. Work with ML Engineering to forecast compute requirements for model training, synthetic data generation, fine-tuning, and large-scale inference. Optimize performance and budget across multiple cloud environments and track usage to maximize value. Manage the internal budgeting process for compute spend. Translate technical needs into financial forecasts and present capital allocation recommendations to company leadership. What We’re Looking For Significant experience in AI and cloud computing, including managing high-value negotiations and partnerships. Strong analytical and strategic skills, with the ability to assess market trends and make informed decisions. Excellent communication and interpersonal abilities, comfortable explaining complex topics to a range of audiences.
Full-time|$200K/yr - $240K/yr|On-site|San Francisco, CA
Contribute to a Safer Future.TRM Labs is at the forefront of blockchain analytics and AI technology, empowering law enforcement, financial institutions, and cryptocurrency enterprises to identify and combat cryptocurrency-related fraud and financial crime. Our innovative blockchain intelligence and AI tools are designed to trace fund flows, pinpoint illicit activities, build comprehensive cases, and provide actionable insights into potential threats. Trusted by prominent agencies and organizations globally, TRM is committed to fostering a safer and more secure environment for everyone.Join our dynamic AI Engineering Team, dedicated to pioneering next-generation AI applications, with a particular emphasis on Large Language Models (LLMs) and agent-based systems. Our objective is to create efficient pipelines, high-caliber infrastructure, and operational tools that facilitate the rapid, safe, and scalable deployment of AI systems.We oversee petabyte-scale data pipelines, deliver models with millisecond latency, and ensure the observability and governance necessary to make AI production-ready. Our team actively evaluates and integrates cutting-edge technologies in the LLM and agent domains, utilizing open-source stacks, vector databases, evaluation frameworks, and orchestration tools that enhance TRM’s agility and innovation capacity.As a Senior or Staff AI Infrastructure Engineer, you will play a pivotal role in constructing and scaling the technical framework for AI and ML systems. Your responsibilities will include:Developing reusable CI/CD workflows for model training, evaluation, and deployment, integrating tools like Langfuse, GitHub Actions, and experiment tracking systems.Automating model versioning, approval workflows, and compliance checks across various environments.Building a modular and scalable AI infrastructure stack, encompassing vector databases, feature stores, model registries, and observability tools.Collaborating with engineering and data science teams to embed AI models and agents into real-time applications and workflows.Continuously assessing and integrating state-of-the-art AI tools (e.g., LangChain, LlamaIndex, vLLM, MLflow, BentoML).Driving AI reliability and governance, facilitating experimentation while ensuring compliance, security, and uptime.Enhancing the performance of AI and ML models.Ensuring data accuracy, consistency, and reliability for improved model training and inference.Deploying infrastructure to support both offline and online evaluations of LLMs and agents.
Full-time|$216.2K/yr - $270.3K/yr|On-site|San Francisco, CA; New York, NY
Join Scale AI's innovative team as an Infrastructure Software Engineer for our Enterprise Generative AI Platform (SGP). In this dynamic role, you will help design and enhance our enterprise-grade AI platform, which offers robust APIs for knowledge retrieval, inference, evaluation, and more. We're seeking an exceptional engineer who thrives in fast-paced environments and is eager to contribute to the scaling of our core infrastructure. The ideal candidate will possess a solid foundation in software engineering principles and extensive experience with large-scale distributed systems. Your role will involve implementing solutions across various cloud providers (GCP, Azure, AWS) for clients in highly regulated sectors, including healthcare, telecommunications, finance, and retail.
About EventualAt Eventual, we are reimagining how AI applications process vast amounts of data, from images to complex datasets. Traditional data platforms are not equipped to handle the petabytes of multimodal data essential for AI, causing teams to struggle with inadequate infrastructure. Founded in 2022, our mission is to simplify data querying, making it as intuitive as working with tables while ensuring scalability for production workloads.Our open-source engine, Daft, is specifically designed for real-world AI systems. It efficiently manages external APIs, GPU clusters, and addresses failures that traditional engines cannot handle. Daft is already integral to operations at leading companies such as Amazon, Mobileye, Together AI, and CloudKitchens.We pride ourselves on our exceptional team, which includes talents from Databricks, AWS, Nvidia, Pinecone, GitHub Copilot, Tesla, and others. We have quadrupled our team size in just a year, supported by Series A and seed funding from notable investors like Felicis, CRV, Microsoft M12, and Y Combinator. We are now eager to expand further. Join us—Eventual is just getting started.We are seeking passionate individuals who are excited to collaborate in a close-knit team environment, working together four days a week in our San Francisco Mission district office.Your Role:As a Software Engineer, you will take charge of developing Eventual's core products and architecture. You’ll deliver features that our customers will use immediately and collaborate with a dedicated team that values open communication and cross-functional teamwork. Our fast-paced environment is focused on solving a variety of complex technical and product challenges. While our experienced team is here to provide guidance and mentorship, we appreciate engineers who can independently identify and tackle challenging technical issues.Key Responsibilities:Design and develop highly reliable and resilient products and features.Collaborate closely with cross-functional product and customer-facing teams to understand requirements and deliver thoughtful solutions.Write high-quality, extensible, and maintainable code.Create and build scalable applications and components.Architect and manage Kubernetes clusters optimized for our needs.
Full-time|Remote|Global Remote / San Francisco, CA
Site Reliability Engineer - AI InfrastructureLocation: Global Remote / San Francisco · Full-TimeAbout AndromedaAndromeda Cluster, established by Nat Friedman and Daniel Gross, aims to democratize access to advanced AI infrastructure for early-stage startups, previously exclusive to hyperscalers. Our journey began with a single managed cluster that quickly reached capacity, propelling us to develop robust systems, networking, and orchestration layers to make AI infrastructure more accessible than ever.Today, we collaborate with top AI laboratories, data centers, and cloud service providers to deliver compute resources precisely when and where they're needed the most. Our platform efficiently manages the routing of training and inference jobs across a global supply chain, facilitating flexibility and efficiency in one of the most rapidly expanding markets worldwide.Our vision is to create a liquidity layer for global AI compute — a marketplace that dynamically moves the infrastructure and workloads essential for AGI, akin to the capital flows in global financial markets.We are on the lookout for talented individuals who excel in AI infrastructure, research, and engineering to join our pioneering team.Your ResponsibilitiesProvision, configure, and manage Kubernetes clusters for clients across various service providers.Develop automation tools to enhance the deployment and integration of clusters.Troubleshoot customer issues related to networking, storage, scheduling, and system layers.Enhance the reliability and scalability of training and inference infrastructures.Design and implement monitoring, alerting, and observability solutions for critical systems.Work collaboratively with engineering and product teams to strategize and deliver infrastructure for new services.Engage in on-call duties and incident response, leading postmortems and reliability enhancements.Ideal Candidate ProfileA minimum of 5 years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles.Solid foundation in Linux systems and networking principles.Extensive expertise in Kubernetes and container orchestration at scale.Proficient in Infrastructure-as-Code methodologies (Terraform, Helm, etc.).
Decagon is seeking an Engineering Manager to lead its AI & Data Infrastructure team in San Francisco. This role centers on guiding engineers as they develop AI solutions and robust data frameworks to advance Decagon’s technology roadmap. Role overview The Engineering Manager will oversee a team dedicated to AI and data infrastructure initiatives. The position involves hands-on leadership, ensuring projects move forward and align with company objectives. What you will do Lead and mentor engineers working on AI and data infrastructure projects Drive project execution to enhance product capabilities Foster a collaborative and supportive team environment Oversee strategic planning and allocate resources for the team Manage team performance and encourage professional growth Requirements Experience leading technical teams in AI and data infrastructure Strong leadership and clear communication abilities Skill in strategic planning and resource management Dedication to building technology solutions that make a difference This position offers the chance to shape Decagon’s products and technology direction through AI and data-driven work.
Reflection AI builds open weight models for a wide range of users, including individuals, businesses, and governments. The team brings together talent from organizations such as DeepMind, OpenAI, Google Brain, Meta, Character.AI, and Anthropic, all working to advance open superintelligence. Role overview The AI Compute and Infrastructure Counsel acts as the main legal advisor to Reflection AI’s Strategy and Operations teams on complex infrastructure initiatives. Based in San Francisco, this attorney leads negotiations and manages agreements that support the company’s growing AI infrastructure. The work spans collaborations with hardware manufacturers, cloud capacity deals, and contracts related to data centers, utilities, and new facility builds. This position is designed for a commercial lawyer with experience at the intersection of advanced AI and infrastructure. The role provides autonomy, the opportunity to establish legal frameworks for a new function, and a direct impact on the company’s AI systems. What you will do Negotiate compute and cloud capacity agreements with hyperscalers, neoclouds, and new vendors, covering terms like capacity reservations, service-level commitments, portability, and exit rights. Manage hardware partnerships with vendors in chips, accelerators, servers, and networking. Oversee legal support for data center and AI facility projects, including master agreements for colocation and hosting, ground leases, build-to-suit leases, construction contracts, interconnection agreements, and power purchase agreements. Structure and negotiate power arrangements, such as power purchase agreements, tolling agreements, utility service contracts, behind-the-meter generation, and long-term energy deals. Lead legal work on strategic infrastructure transactions, including joint ventures, site acquisitions, and custom financing models for the AI factory roadmap. Develop scalable playbooks, templates, and delegation systems to help commercial and infrastructure teams operate efficiently and maintain high standards. Collaborate with Security, Privacy, and Policy teams on matters like tenant isolation, customer data handling, and sovereign compute requirements.
Plasmidsaurus helps scientists worldwide by streamlining sequencing. Researchers from leading institutions and companies rely on this platform daily. With a global network of labs, the company delivers fast, affordable sequencing results, and has recently expanded into RNA-seq to broaden its genomics reach. The team is focused on building a universal sequencing platform designed for efficiency and global scale. Role overview The Lead Engineer for AI Infrastructure in Platform Engineering sets both technical direction and management strategy for the company’s compute, data, AI, and security infrastructure. This position oversees the entire sequencing operation, from laboratory devices to data delivery. What you will do Oversee core services that coordinate laboratory devices, including robots, sequencers, and on-premises Linux servers, as well as the data ingestion pipeline. Develop cloud infrastructure and data pipelines for storing, processing, and delivering terabytes of sequencing data. Design systems to manage millions of bioinformatics tasks, handling queue management, workflow orchestration, and scheduling. Build AI infrastructure and internal tools to support autonomous systems, including: Quality Scientist Agents: Monitor operations, detect anomalies, and escalate quality or reliability concerns. Logistics Agents: Coordinate global transportation of samples to labs and carriers. Bioinformatics Coding Agents: Run adaptive analyses on varied sample types with different data distributions. Culture The team values initiative and a strong sense of ownership. High agency and responsibility shape how work gets done at Plasmidsaurus.
At Stuut, we are revolutionizing the accounts receivable landscape for B2B companies by streamlining collections processes to be smarter and more efficient. Our innovative platform has garnered attention from finance teams across various industries, including industrials, chemicals, and manufacturing, ranging from Fortune 10 giants to emerging mid-market players. We are proud to be supported by premier investors such as a16z, Khosla, Activant, 1984 Ventures, and Page One.About the RoleVoice telecommunications play a vital role in accounts receivable operations. Our customers initiate contact via phone, agents engage in calls, and effective transfers are crucial. We are developing AI-driven agents that actively contribute to collections, cash application, disputes, and customer communications, with voice being a key component of these interactions.We are seeking a Lead Voice Infrastructure Engineer to lay the groundwork for our AI-enhanced voice initiatives at Stuut. In this position, you will collaborate closely with our engineering, product, and go-to-market teams to design and maintain the telephony infrastructure that supports high-volume phone interactions. This role merges telephony infrastructure, real-time voice systems, AI voice agents, and finance workflows.The ideal candidate will possess deep expertise in telephony, be adept at constructing distributed systems that perform reliably under production loads, and recognize the significant impact of infrastructure decisions on product outcomes. You will oversee the systems that manage call connectivity, transfer effectiveness, and business outcome measurements. As the first dedicated voice infrastructure hire at Stuut, you will be instrumental in shaping our approach.
Full-time|On-site|San Francisco, CA; Seattle, WA; New York, NY
Scale AI is seeking a Senior AI Infrastructure Engineer to help build and refine the company’s Training Platform. This position centers on designing, implementing, and improving infrastructure that supports machine learning teams as they train and deploy models. Role overview This engineer will work closely with colleagues across different functions to create solutions that make AI systems more efficient. The focus is on enabling faster, more reliable model training and deployment. Key responsibilities Design and build infrastructure for AI model training Implement and optimize systems to support machine learning workflows Collaborate with teams throughout the company to improve platform capabilities Locations This role is based in San Francisco, Seattle, or New York.
About UsAt Sierra, we are revolutionizing the way businesses engage with their customers by building a cutting-edge platform that harnesses the power of AI. Our headquarters is located in the vibrant city of San Francisco, with additional offices expanding in Atlanta, New York, London, France, Singapore, and Japan.Our company culture is deeply rooted in our core values: Trust, Customer Obsession, Craftsmanship, Intensity, and Family. These principles guide our actions and foster an environment where innovation thrives.Sierra was co-founded by visionary leaders Bret Taylor, who currently serves as the Board Chair of OpenAI and has a rich history with Salesforce and Facebook, and Clay Bavor, who previously led Google Labs and spearheaded initiatives like Google Lens and Project Starline.Your RoleAs a Software Engineer focusing on Infrastructure at Sierra, you will play a pivotal role in designing, constructing, and maintaining the foundational systems that empower our AI platform. Your expertise will ensure that our infrastructure is not only secure and reliable but also scalable, allowing product teams to execute their work with agility and confidence.Guarantee the reliability, scalability, and performance of our platform and LLM inference serving in response to increasing traffic demands.Develop and oversee cloud infrastructure using Terraform to create secure, scalable, and reproducible environments.Establish and manage a self-service infrastructure platform to empower engineering teams in deploying and operating services independently.Take ownership of and improve CI/CD pipelines and release management processes, facilitating rapid and reliable deployments across Sierra’s platform.Design and manage distributed systems utilizing distributed databases, retrieval systems, and machine learning models.Develop and sustain core data serving abstractions along with essential authentication and security features (SSO, RBAC, authentication controls).Effectively navigate and integrate our technology stack with enterprise customer environments in a scalable and maintainable manner.
At Exa, we are on a mission to create a cutting-edge search engine from the ground up, designed to cater to the diverse needs of AI applications. Our team is building a robust infrastructure that enables us to crawl the internet, train advanced embedding models for indexing, and develop high-performance vector databases using Rust. Additionally, we manage a significant $5M H200 GPU cluster that powers tens of thousands of machines.The Infrastructure Team at Exa is responsible for developing the essential tools and infrastructure that support our entire system. We are looking for talented infrastructure engineers to help us scale our capabilities rapidly. Your work could involve orchestrating GPU clusters with Kubernetes, implementing map-reduce batch jobs on Ray, or creating top-tier observability tools that set industry standards.
Sep 3, 2025
Sign in to browse more jobs
Create account — see all 7,332 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.