Senior HPC & GPU Infrastructure Engineer

SciforiumSan Francisco

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

The ideal candidate will have a strong background in high-performance computing and GPU technologies, coupled with hands-on experience in Linux system administration. Proficiency in managing machine learning frameworks and tools such as CUDA, PyTorch, and JAX is essential. You should also have excellent problem-solving skills and the ability to work collaboratively in a fast-paced environment. A Bachelor's degree in Computer Science, Engineering, or a related field is preferred.

About the job

About the Role

We are looking for a highly skilled Senior HPC & GPU Infrastructure Engineer who will be responsible for ensuring the health, reliability, and performance of our GPU compute cluster. As the primary custodian of our high-density accelerator environment, you will serve as the crucial link between hardware operations, distributed systems, and machine learning workflows. This position encompasses a range of responsibilities, from hands-on Linux systems engineering and GPU driver setup to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you are passionate about optimizing hardware performance, enjoy troubleshooting GPUs at scale, and aspire to create world-class AI infrastructure, we would love to hear from you.

Your Responsibilities

1. System Health & Reliability (SRE)

On-Call Response: Be the primary responder for system outages, GPU failures, node crashes, and other cluster-wide incidents, ensuring rapid issue resolution to minimize downtime.
Cluster Monitoring: Develop and maintain monitoring protocols for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and general system load.
Vendor Liaison: Collaborate with data center personnel, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster.

2. Linux & Network Administration

OS Management: Oversee the installation, patching, and maintenance of Linux distributions (Ubuntu / CentOS / RHEL), ensuring consistent configuration, kernel tuning, and automation for large node fleets.
Security & Access Controls: Set up VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computing infrastructure.
Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity and administer distributed file systems like NFS, GPFS, or Lustre.

3. GPU & ML Stack Engineering

Deployment & Bring-Up: Spearhead the deployment of new GPU nodes, including BIOS configuration and software integration to ensure optimal performance.

About Sciforium

Sciforium is a cutting-edge AI infrastructure company committed to developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. With significant investment backing and direct support from AMD, we are rapidly growing our team to build the comprehensive stack that powers advanced AI models and real-time applications.

Similar jobs

1 - 20 of 8,545 Jobs

Search for Senior Ai Infrastructure Engineer

8,545 results

Select all on this page (20)

Apply

Senior AI Infrastructure Engineer

Hyperbolic Labs

Full-time|On-site|San Francisco, CA

Join Our MissionAt Hyperbolic Labs, we are dedicated to democratizing artificial intelligence by eliminating barriers to computing power through our Open-Access AI Cloud. We aggregate global computing resources to provide an innovative GPU marketplace and AI inference service, making AI affordable and accessible for everyone. As pioneers at the crossroads of AI and open-source technology, we envision a future where AI innovation is driven by imagination, not resource limitations. We invite forward-thinking individuals who share our vision of making AI universally accessible, secure, and cost-effective to join us in crafting a platform that empowers innovators to realize their groundbreaking AI projects.As we gear up for expansion following our Series A funding, our team, led by co-founders with PhDs in AI, Mathematics, and Computer Science, is set to transform the landscape of computing.The RoleWe are on the lookout for a Senior Infrastructure Engineer to drive the development and scaling of Hyperbolic's GPU Cloud Marketplace. In this pivotal role, you will create a multi-tenancy provisioning and virtualization solution that transforms raw GPUs from diverse global suppliers into a programmable, orchestrated resource pool serving thousands of AI developers and researchers. You will work at the forefront of cloud infrastructure, building the core orchestration layer that allows our platform to deliver cost savings of up to 75% compared to traditional cloud providers.

Mar 26, 2026

Apply

Senior Software Engineer, Infrastructure at Retell AI | San Francisco

Retell AI

Full-time|On-site|San Francisco Bay Area

Join the Revolution at Retell AIRetell AI is pioneering the future of call centers through innovative voice AI, driven by first principles thinking.In just 18 months since our inception, we have empowered thousands of businesses with our AI voice agents, transforming how sales, support, and logistics calls are managed—previously requiring extensive human teams. Supported by prestigious investors such as Y Combinator and Alt Capital, we've rapidly scaled from $5M ARR to an impressive $36M ARR with a compact yet dynamic team of 20.Our ambition for 2026 is to create a revolutionary customer experience platform, where entire contact centers are powered by AI. Moving beyond basic automation, we aim to develop intelligent AI “workers” that serve as frontline agents, QA analysts, and managers, continuously enhancing customer interactions without the need for constant human oversight.As we expand, we are seeking passionate engineers who are eager to solve challenging technical problems, act swiftly, and make a significant impact in one of the fastest-growing voice AI startups. Let’s shape the future together.

Aug 12, 2025

Apply

Senior or Staff AI Infrastructure Engineer

TRM Labs

Full-time|$200K/yr - $240K/yr|On-site|San Francisco, CA

Contribute to a Safer Future.TRM Labs is at the forefront of blockchain analytics and AI technology, empowering law enforcement, financial institutions, and cryptocurrency enterprises to identify and combat cryptocurrency-related fraud and financial crime. Our innovative blockchain intelligence and AI tools are designed to trace fund flows, pinpoint illicit activities, build comprehensive cases, and provide actionable insights into potential threats. Trusted by prominent agencies and organizations globally, TRM is committed to fostering a safer and more secure environment for everyone.Join our dynamic AI Engineering Team, dedicated to pioneering next-generation AI applications, with a particular emphasis on Large Language Models (LLMs) and agent-based systems. Our objective is to create efficient pipelines, high-caliber infrastructure, and operational tools that facilitate the rapid, safe, and scalable deployment of AI systems.We oversee petabyte-scale data pipelines, deliver models with millisecond latency, and ensure the observability and governance necessary to make AI production-ready. Our team actively evaluates and integrates cutting-edge technologies in the LLM and agent domains, utilizing open-source stacks, vector databases, evaluation frameworks, and orchestration tools that enhance TRM’s agility and innovation capacity.As a Senior or Staff AI Infrastructure Engineer, you will play a pivotal role in constructing and scaling the technical framework for AI and ML systems. Your responsibilities will include:Developing reusable CI/CD workflows for model training, evaluation, and deployment, integrating tools like Langfuse, GitHub Actions, and experiment tracking systems.Automating model versioning, approval workflows, and compliance checks across various environments.Building a modular and scalable AI infrastructure stack, encompassing vector databases, feature stores, model registries, and observability tools.Collaborating with engineering and data science teams to embed AI models and agents into real-time applications and workflows.Continuously assessing and integrating state-of-the-art AI tools (e.g., LangChain, LlamaIndex, vLLM, MLflow, BentoML).Driving AI reliability and governance, facilitating experimentation while ensuring compliance, security, and uptime.Enhancing the performance of AI and ML models.Ensuring data accuracy, consistency, and reliability for improved model training and inference.Deploying infrastructure to support both offline and online evaluations of LLMs and agents.

Mar 12, 2026

Apply

Solutions Engineer - AI Cloud Infrastructure

novita-ai

Full-time|On-site|San Francisco

About Us:At novita-ai, we are a rapidly growing global provider of AI cloud infrastructure, leading the charge in the artificial intelligence revolution. Our innovative platform equips developers and enterprises with powerful, scalable, and user-friendly solutions such as Model APIs, GPU Instances, and Serverless Computing. As organizations around the globe strive to integrate AI into their offerings, we serve as the essential engine that fuels their innovative efforts.Join our world-class team and contribute to our expanding customer base. This unique opportunity allows you to be part of a dynamic company in a hyper-growth market, where your technical skills will directly impact customer success and drive our business forward.The Role:As a Solutions Engineer, you will act as the primary technical leader and trusted advisor for our clients throughout their journey. You will collaborate closely with the sales team to bridge the gap between complex customer challenges and our sophisticated technical solutions. Your mission is to build technical credibility, demonstrate the capabilities of our platform, and design tailored solutions that empower our clients to achieve their AI-related business objectives.What You'll Do:Technical Discovery & Solution Design: Collaborate with Account Executives to gain a deep understanding of customer needs, technical requirements, and business goals. Develop elegant and effective solutions utilizing our AI infrastructure stack (Model APIs, GPU Instances, Serverless).Product Demonstration & Proof of Concept (POC): Conduct engaging, customized product demonstrations and interactive workshops. Plan, manage, and execute successful POCs, showcasing the value and performance of our platform within the client’s environment.Technical Evangelism & Trusted Advisory: Communicate the value proposition of our platform to diverse audiences, including both technical and non-technical stakeholders, from engineers to C-level executives. Establish yourself as the go-to expert for customers on best practices in AI infrastructure.Sales Enablement & Market Feedback Loop: Create and maintain technical sales materials, including whitepapers, best practice guides, and demo scripts. Serve as the voice of the customer, relaying valuable feedback from the field to our Product and Engineering teams to influence our product roadmap.Onboarding & Implementation Guidance: Facilitate a seamless post-sales transition by providing initial onboarding support and architectural guidance, setting customers up for sustained success.

Aug 27, 2025

Apply

Senior AI Infrastructure Engineer - Training Platform

Scale AI

Full-time|On-site|San Francisco, CA; Seattle, WA; New York, NY

Scale AI is seeking a Senior AI Infrastructure Engineer to help build and refine the company’s Training Platform. This position centers on designing, implementing, and improving infrastructure that supports machine learning teams as they train and deploy models. Role overview This engineer will work closely with colleagues across different functions to create solutions that make AI systems more efficient. The focus is on enabling faster, more reliable model training and deployment. Key responsibilities Design and build infrastructure for AI model training Implement and optimize systems to support machine learning workflows Collaborate with teams throughout the company to improve platform capabilities Locations This role is based in San Francisco, Seattle, or New York.

Apr 29, 2026

Apply

Senior AI Infrastructure Engineer - Model Serving Platform

Scale AI

Full-time|$216.2K/yr - $270.3K/yr|On-site|San Francisco, CA; New York, NY

Join our dynamic Machine Learning Infrastructure team as a Senior AI Infrastructure Engineer, where you will play a pivotal role in designing and constructing platforms that ensure the scalable, reliable, and efficient serving of Large Language Models (LLMs). Our innovative platform supports a range of cutting-edge research and production systems, catering to both internal and external applications across diverse environments.The ideal candidate will possess a solid foundation in machine learning principles coupled with extensive experience in backend system architecture. You will thrive in a collaborative environment that bridges research and engineering, working diligently to provide seamless experiences for our customers and accelerating innovation across the organization.

Mar 26, 2026

Apply

Senior Infrastructure Engineer

LlamaIndex

Full-time|On-site|San Francisco

Be part of our mission to redefine AI by shaping the narrative surrounding document understanding.Role OverviewAt LlamaIndex, our Infrastructure team lays the groundwork for our product and provides essential tools that facilitate the development, deployment, and monitoring of our code. We are tasked with designing, constructing, and scaling the core infrastructure that drives a high-capacity data platform for AI applications. We seek individuals who are passionate about creating supportive systems that enhance our engineering capabilities and contribute to our rapidly expanding product suite.Ideal candidates will have a strong background in cloud infrastructure management, navigating various scalability challenges, and enhancing the productivity of the broader Engineering team. Key traits we value in our culture include a customer-centric mindset, collaboration, diligence, and optimism. We are looking for proactive team players who are eager to help us evolve our culture as we grow.Key ResponsibilitiesCollaborate with engineering teams to develop and maintain foundational systems that empower developers and support our rapid growth.Design and execute scalable infrastructure solutions suitable for various deployment models, including SaaS, single-tenant, and private environments.Oversee and optimize cloud resources and Kubernetes clusters to ensure cost-effectiveness and high performance.Facilitate successful external customer deployments by establishing clear infrastructure guidelines and principles.Enhance the release and deployment processes to improve efficiency and reliability.Ensure compliance with applicable regulations and implement comprehensive security measures across all deployment environments.QualificationsMinimum of 5 years of engineering experience.Experience working on Platform or Infrastructure teams on substantial projects involving infrastructure components like Terraform/CDKTF, Kubernetes, Helm, testing infrastructure, release management, and observability.Proficient in optimizing cloud resource utilization.Skilled in tuning Kubernetes clusters and cloud resources for optimal performance and cost efficiency.Dedicated to cultivating LlamaIndex’s engineering culture as we expand.Ability to balance speed and pragmatism in delivering solutions.

Feb 24, 2026

Apply

Senior Staff Infrastructure Engineer

TwelveLabs

Full-time|On-site|San Francisco

Who We Are:TwelveLabs is at the forefront of developing innovative multimodal foundation models that enable video comprehension akin to human understanding. Our groundbreaking models have set new benchmarks in video-language modeling, enhancing our capabilities and revolutionizing how we engage with and analyze diverse media formats.With an impressive $107 million in Seed and Series A funding, we're supported by premier venture capital firms including NVIDIA’s NVentures, NEA, Radical Ventures, and Index Ventures, alongside influential AI pioneers like Fei-Fei Li, Silvio Savarese, and Alexandr Wang. Our headquarters in San Francisco, complemented by a significant presence in Seoul, highlights our dedication to fostering global innovation.We celebrate the individuality of every team member’s journey, believing that the diverse cultural, educational, and life experiences of our employees fuel our ability to challenge the status quo. We seek passionate individuals who resonate with our mission and are eager to make a significant impact as we advance technology to reshape the world. Join us in redefining video understanding and multimodal AI.About the RoleAs a Senior Staff Infrastructure Engineer at TwelveLabs, you will leverage your technical expertise and leadership skills to construct the systems that drive our multimodal foundation models. Your focus will be on designing and enhancing a scalable, secure, and high-performance infrastructure that accommodates extensive AI workloads across both cloud-based and on-premises environments.This position demands strong technical acumen, an eagerness to delve into low-level systems when necessary, and the capability to influence infrastructure strategy through hands-on contributions and operational improvements. Your impact will be felt through your technical expertise and the results you deliver, rather than through hierarchical status, in a dynamic and fast-paced environment.In this role, you will:Architect and advance cloud and hybrid infrastructure, blending hands-on execution with technical leadership.Guide the development of AI/ML infrastructure components, engaging directly in critical tasks when necessary.Define infrastructure standards and abstractions while maintaining close interaction with production systems.Collaborate closely with Machine Learning Engineers, Data Scientists, Backend Developers, and other key stakeholders to ensure system alignment and efficiency.

Oct 10, 2025

Apply

Senior Software Engineer, Infrastructure

Serval

Full-time|On-site|San Francisco

Who We AreServal is an innovative AI-driven automation platform redefining operational efficiency for enterprises. Our intelligent agents seamlessly comprehend and execute real-world workflows, replacing outdated manual processes with adaptive, self-learning software. Since our inception in early 2024, we have garnered the trust of industry leaders such as General Motors, Notion, Perplexity, Vercel, Mercor, LangChain, and Verkada, streamlining high-volume operational tasks across their organizations.At the heart of Serval is a cutting-edge agentic AI platform that transforms natural language into actionable workflows. Our agents not only respond to queries but also reason, act across various systems, and continuously enhance their performance. What started as a solution for operational tasks has rapidly expanded into a versatile AI automation layer utilized across IT, HR, Finance, Security, Legal, and Engineering sectors.Our mission is to eradicate repetitive, manual tasks within enterprises, empowering teams through intelligent automation. In the long run, we aim to establish a universal AI operations layer—a system of agents that integrates across business functions, maintaining the momentum of modern companies.We are proud to be backed by renowned investors including Sequoia Capital, Redpoint Ventures, Meritech, First Round, General Catalyst, and Elad Gil, and founded by seasoned product and engineering leaders from Verkada.Role OverviewAs a Senior Software Engineer in Infrastructure at Serval, you will be pivotal in developing and scaling the core systems that empower our AI agents and workflow automation platform. A crucial aspect of this role involves enabling and supporting self-hosted deployments for enterprise clients needing on-premises or private cloud environments. We are looking for engineers with profound expertise in distributed systems, infrastructure-as-code, production operations, and customer-facing support, who aspire to influence the technical architecture of a rapidly evolving platform.What You'll DoDesign, implement, and operate large-scale distributed systems that power Serval's AI agents, workflow orchestration, and data pipelines.Create and maintain Terraform modules to provision and manage cloud infrastructure across AWS, GCP, or Azure environments.Develop and sustain deployment packages, installation scripts, and infrastructure templates, enabling customers to self-host Serval in their own environments.Provide technical support and guidance to enterprise customers during installation and deployment phases.

Jan 29, 2026

Apply

Senior / Staff Infrastructure Engineer

Apiphany

Full-time|$160K/yr - $300K/yr|On-site|San Francisco

About ApiphanyApiphany is a trailblazing AI company focused on revolutionizing physical product development. We empower innovators across automotive, aerospace, medtech, and energy sectors to convert vast unstructured technical data into real-time, actionable insights. Supported by elite investors including Markforged, Databricks, GM, and Character, our mission is to transform engineering decision-making, turning complexity into simplicity for leading manufacturers worldwide.Our advanced models are designed to address the intricacies of engineering and manufacturing, comprehending physics principles, design specifications, and program constraints. Our small, elite team consists of builders hailing from prestigious institutions such as Stanford, Berkeley, MIT, UW, and CMU, along with industry veterans from GM, Ford, and Genesis Therapeutics. We are committed to advancing hard-tech and establishing a market-leading company together.About the RoleIn the role of Senior / Staff Infrastructure Engineer at Apiphany, you will architect, build, and manage the infrastructure that underpins our intelligence platform. Your responsibilities will encompass secure, reliable, and scalable cloud deployments, including the unique challenge of deploying across both internal and customer-managed cloud environments.You will ensure our systems adhere to stringent requirements for latency, availability, and compliance within data-intensive environments. Additionally, you will shape our security strategy, implement infrastructure-as-code practices, and establish a solid foundation enabling engineering teams to deliver with assurance.

Oct 23, 2025

Apply

Senior HPC & GPU Infrastructure Engineer

Sciforium

Full-time|On-site|San Francisco

At Sciforium, we are at the forefront of AI infrastructure, pioneering advanced multimodal AI models and an innovative, high-efficiency serving platform. With substantial backing from AMD and a dedicated team of engineers, we are rapidly expanding our capabilities to support the next generation of frontier AI models and real-time applications.About the RoleWe are looking for a highly skilled Senior HPC & GPU Infrastructure Engineer who will be responsible for ensuring the health, reliability, and performance of our GPU compute cluster. As the primary custodian of our high-density accelerator environment, you will serve as the crucial link between hardware operations, distributed systems, and machine learning workflows. This position encompasses a range of responsibilities, from hands-on Linux systems engineering and GPU driver setup to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you are passionate about optimizing hardware performance, enjoy troubleshooting GPUs at scale, and aspire to create world-class AI infrastructure, we would love to hear from you.Your Responsibilities1. System Health & Reliability (SRE)On-Call Response: Be the primary responder for system outages, GPU failures, node crashes, and other cluster-wide incidents, ensuring rapid issue resolution to minimize downtime.Cluster Monitoring: Develop and maintain monitoring protocols for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and general system load.Vendor Liaison: Collaborate with data center personnel, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster.2. Linux & Network AdministrationOS Management: Oversee the installation, patching, and maintenance of Linux distributions (Ubuntu / CentOS / RHEL), ensuring consistent configuration, kernel tuning, and automation for large node fleets.Security & Access Controls: Set up VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computing infrastructure.Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity and administer distributed file systems like NFS, GPFS, or Lustre.3. GPU & ML Stack EngineeringDeployment & Bring-Up: Spearhead the deployment of new GPU nodes, including BIOS configuration and software integration to ensure optimal performance.

Jan 7, 2026

Apply

AI Infrastructure Engineer

Spellbrush

Full-time|On-site|San Francisco

Join Our Team as an AI Infrastructure EngineerAt Spellbrush, the premier generative AI studio behind niji・journey, we are in search of a talented AI Infrastructure Engineer to help us develop and enhance our end-to-end machine learning infrastructure, facilitating the operation of our models across a variety of platforms.Key Responsibilities:Design, implement, and maintain next-generation inference architecture to optimize the performance of our models across mobile, web, and other platforms.Collaborate with a dynamic team focused on creating cutting-edge image generation models that serve over 16 million users globally.Ideal Candidate Profile:Experience with Large Distributed Systems: You possess a strong background in working with modern technologies such as Kubernetes (K8S), Kafka, NATS, Redis, among others. Your hands-on experience spans both on-premises and multi-cloud environments, and you understand the intricacies and potential pitfalls of each system.Expertise in GPU Workloads: Your understanding of GPU processing for handling substantial workloads sets you apart. Having experience in deploying or optimizing GPU workloads end-to-end is a significant advantage.Passion for Anime Aesthetics: As avid anime enthusiasts, we value team members who share our passion for the anime aesthetic, contributing to a creative movement that engages millions.Team Player in Fast-Paced Environments: You thrive in small, agile teams and are eager to work alongside some of the world's top AI researchers, contributing to the best image models globally. We believe in the power of in-person collaboration, with opportunities at our offices in Tokyo (downtown Akihabara) or San Francisco. Visa sponsorships are available.

Feb 7, 2024

Apply

Senior Applied AI Engineer – Machine Learning for Systems & Infrastructure

Databricks

Full-time|$166K/yr - $210.3K/yr|On-site|San Francisco, California

P-1380 Join Databricks as a Senior Applied AI Engineer, where you will harness the power of machine learning, scheduling, and optimization algorithms to enhance the efficiency and performance of our engineering systems and infrastructure. Our Applied AI team tackles some of the most challenging and fascinating issues in the industry, ensuring that Databricks infrastructure and products operate at peak performance and cost efficiency. This role is critical, as our customers depend on us to deliver the most optimized workloads. Your Impact: Develop comprehensive systems from the ground up within a dynamic team of seasoned professionals. Influence the direction of our applied machine learning investment areas by collaborating with engineering and product teams across the organization. Lead the design and implementation of advanced AI models and systems that enhance the capabilities and performance of Databricks' products, infrastructure, and services. Architect and deploy robust, scalable machine learning infrastructure, including data storage, processing, model training, serving components, and monitoring systems to facilitate seamless integration of AI/ML models into production environments. Explore innovative modeling techniques in the realm of machine learning for systems. Contribute to the wider AI community by publishing research, presenting at conferences, and actively engaging in open-source projects, thereby strengthening Databricks' reputation as an industry leader.

Jan 30, 2026

Apply

Senior Software Engineer - Infrastructure

Julius

Full-time|On-site|San Francisco, CA

Compensation: Competitive base salary + substantial equityBenefits: Health & dental insurance, gym reimbursement, daily team lunches, 401(K)About JuliusAt Julius, we're pioneering advancements in applied AI by developing cutting-edge coding agents. Our platform executes approximately 1 million lines of code every 36 hours, serving over 1 million users and generating 3 million+ visualizations. We manage all code in isolated remote containers. As a revenue-generating entity, we are backed by AI Grant and founders with remarkable backgrounds from companies like Vercel, Notion, Perplexity, Palantir, Replit, Zapier, Intercom, and Dropbox.The RoleJoin us in building and scaling the robust code-execution platform that powers Julius, across both cloud and on-prem environments. We orchestrate over 500,000 containers/month and the demand is growing rapidly. You will take ownership of reliability, performance, and security within our multi-tenant compute environment.Your ResponsibilitiesDesign and manage a secure, multi-tenant container infrastructure that ensures quick startup and intelligent autoscaling.Implement on-prem/private cloud deployments using Helm and Terraform, integrating SSO, network controls, and audit logging.Enhance observability (metrics, traces, logs) with well-defined SLOs and lead incident response initiatives.Optimize images, scheduling, networking, and costs, while developing fair-use and rate-limiting controls.Your QualificationsStrong experience with production Kubernetes and container internals (Docker/containerd); solid understanding of networking principles.Familiarity with cloud environments (AWS/GCP/Azure) and Infrastructure as Code (Terraform/Helm).Proficiency in monitoring and logging tools (Prometheus, Grafana, OpenTelemetry, ELK/Vector).Understanding of security best practices for containerized, multi-tenant systems.Preferred QualificationsExperience with gVisor, Kata, Firecracker; Cilium/eBPF; GPU scheduling; serverless autoscaling (KEDA/Knative/Karpenter).Proven experience delivering on-prem or air-gapped enterprise software solutions.A passion for AI, with experience building side projects involving LLMs.Why Join Julius?Be part of a small, senior team where your contributions will have a massive impact. Tackle challenging infrastructure problems at a meaningful scale.

Aug 11, 2025

Apply

AI Engineer — LLM Infrastructure

Yutori

Full-time|On-site|San Francisco, California, United States

At Yutori, we are transforming the way individuals engage with the digital realm by developing AI agents capable of efficiently performing everyday online tasks. Our approach is to create a comprehensive, agent-first ecosystem, encompassing everything from training proprietary models to designing innovative generative product interfaces.To further this mission, we are seeking a skilled AI Engineer to join our pioneering team. Ideal candidates should possess strong technical expertise and a passion for crafting superhuman AI agents that can navigate the web autonomously.Our founders — Devi Parikh, Abhishek Das, and Dhruv Batra — bring a wealth of experience in AI research and product development, particularly in generative, multimodal, and embodied AI, honed during their time at Meta. Our team merges AI proficiency with a design-oriented approach to advance Yutori’s objectives.Yutori is proudly supported by a distinguished group of visionary investors, including Elad Gil, Sarah Guo, Jeff Dean, Fei-Fei Li, Amjad Masad, Guillermo Rauch, Akshay Kothari, Soleio, Oliver Cameron, Julien Chaumond, Logan Kilpatrick, Bryan McCann, Vladlen Koltun, Jamie Cuffe, Michele Catasta, and many others.

Mar 26, 2025

Apply

Software Engineer, Frontier AI Infrastructure

Scale AI

Full-time|$138K/yr - $259.4K/yr|On-site|San Francisco, CA; St. Louis, MO; New York, NY; Washington, DC

Scale AI is on the lookout for an exceptionally talented and driven Software Engineer, Frontier AI Infrastructure to become an integral part of our innovative Public Sector Engineering team. In this role, you will take charge of the model inference layer, enabling cutting-edge AI models, troubleshooting the latest AI tools, managing networking tasks, addressing latency issues, and monitoring pricing and usage metrics for AI models. You will spearhead technical discussions with cloud vendors and clients to fulfill critical contracts and resolve platform challenges. Additionally, you will collaborate closely with Product teams to anticipate feature requirements, transitioning from reactive 'infra-only debugging' to proactive integration testing.Your Responsibilities Include:Designing and implementing secure, scalable backend systems tailored for Public Sector clients, utilizing Scale's advanced cloud-native AI infrastructure.Owning services or systems while defining long-term health objectives and enhancing the health of related components.Redesigning the architecture to operate in compliant or restrictive environments, which entails creating swappable components (authentication, storage, logging) to adhere to government and security regulations without compromising product integrity.Collaborating with Product teams to develop integration tests that identify issues early, shifting focus from 'infra-only debugging' to preventing upstream failures.Actively participating in customer engagements, liaising with stakeholders to comprehend requirements and deliver innovative solutions.Contributing to the platform roadmap and product strategy for Scale AI's Public Sector division, playing a vital role in shaping the future trajectory of our offerings.

Mar 26, 2026

Apply

Senior Infrastructure Engineer

Mercury

Full-time|Remote|San Francisco, CA, New York, NY, Portland, OR, or Remote within Canada or United States

Join Mercury as a Senior Infrastructure Engineer, where you will be pivotal in shaping the infrastructure that supports our innovative financial solutions. You will work closely with cross-functional teams to design, implement, and maintain scalable and reliable infrastructure systems. This role is ideal for individuals who thrive in a fast-paced environment and are passionate about leveraging technology to drive business success.

Mar 19, 2026

Apply

Senior Backend/Infrastructure Engineer

AgentMail

Full-time|$250/yr - $400/yr|On-site|San Francisco

Location: San Francisco, CAType: Full-time, on-siteOverview: As the demand for AI agents surges, we are on a rapid hiring spree to meet this challenge. At AgentMail, we are innovating in a space where the solutions are still being discovered; we focus on creating systems that empower agents as essential components of the internet. If you thrive in building products from the ground up, we want to hear from you.AgentMail is developing an API that provides dedicated inboxes for AI agents, serving as their primary identity and communication hub. This isn't just AI-enhanced email; it's email tailored for AI.We are searching for a foundational engineer with robust backend and infrastructure skills to create the essential tools that agents will utilize for communication, authentication, and real-world operations. This role is perfect for someone who enjoys working with systems: APIs, distributed architecture, reliability, and developer tools.Why Join Us?Exceptional Team: Our team comprises industry experts with backgrounds in quantitative analysis, software engineering, and private equity, featuring talent from companies like Optiver, NVIDIA, Amazon, Modern Treasury, and KKR.Rapid Growth: Our user base has doubled every two weeks since January, and we anticipate this growth to continue. With agents soon surpassing human numbers, we expect significant market changes.Unique Position: We are the exclusive identity provider for AI agents, similar to how Gmail serves human users.Key ResponsibilitiesLead and mentor a team of 4-6 engineers.Design user-friendly APIs and abstractions for humans and AI agents alike.Develop high-performance, low-latency systems with AI agents as the end-users.Engage directly with our core infrastructure, managing the processing of hundreds of thousands of emails daily.Innovate at the intersection of advanced AI capabilities and traditional email systems.Oversee projects from inception through to production deployment.Produce clean, maintainable code along with comprehensive technical documentation.Actively gather and respond to customer feedback to drive improvements and refinements.

Mar 9, 2026

Apply

Senior Site Reliability Engineer for AI Infrastructure

Andromeda

Full-time|Remote|Global Remote / San Francisco, CA

Join Andromeda as a Senior Site Reliability Engineer specializing in AI Infrastructure. In this pivotal role, you will be responsible for ensuring the reliability, scalability, and performance of our cutting-edge AI systems. Collaborate with cross-functional teams to design and implement robust infrastructure solutions that support our innovative AI initiatives. Your expertise will play a crucial role in maintaining optimal service availability and improving system performance.

Apr 9, 2026

Apply

Software Engineer - AI Infrastructure

Andromeda Cluster

Full-time|Remote|North America Remote / San Francisco, CA

Join Our Team as a Software Engineer - AI InfrastructureLocation: North America Remote / San Francisco · Full-TimeAt Andromeda Cluster, we are dedicated to democratizing access to advanced AI infrastructure that was once only available to hyperscalers. Founded by industry leaders Nat Friedman and Daniel Gross, we have evolved from a singular managed cluster to a global platform that connects top AI labs, data centers, and cloud providers around the world. Our orchestration layer efficiently manages training and inference tasks globally, enhancing flexibility and efficiency in this rapidly expanding sector. We aim to create a global marketplace for AI computing, empowering AGI with the same fluidity as global financial markets.As we continue to grow, we are on the lookout for talented individuals in the fields of AI infrastructure, research, and engineering.Your RoleIn the position of Infrastructure Product Engineer, you will be integral in constructing the foundational framework of Andromeda’s platform. Your challenge will be to simplify complex, real-world infrastructure issues into scalable product solutions that our customers will benefit from.Key ResponsibilitiesArchitect and develop essential platform components, focusing on infrastructure orchestration, provisioning, and lifecycle management solutions.Create robust APIs, services, and control planes that abstract diverse infrastructure types, including VMs, Kubernetes, bare metal, and schedulers.Convert customer usage patterns into actionable product requirements, delivering impactful features and enhancements.Design automation and internal tools to mitigate manual and ad-hoc operational tasks.Improve platform reliability, performance, and observability, focusing on sustainable enhancements rather than quick fixes.Collaborate with other teams to establish clear ownership boundaries between platform features and customer-specific solutions.Write clean, maintainable, and well-documented code with a focus on long-term sustainability.Engage in technical design discussions and contribute to the architectural advancements of our platform.

Feb 18, 2026

Create account — see all 8,545 results