High-Performance Computing (HPC) Hardware Engineer

sfcomputeSan Francisco, CA

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

A minimum of 5 years of experience in operating, supporting, and scaling at least one HPC or GPU compute cluster in a production environment (preferably with more than 1,000 GPUs). Strong understanding of server hardware fundamentals and architecture.

About the job

At sfcompute, we are on a mission to revolutionize the infrastructure landscape by minimizing the risks associated with the largest build-outs in history.

When financing GPU clusters and the data centers that support them, having a contract in place—what we call an "offtake"—is crucial. This ensures that customers have signed on to lease the cluster even before it’s constructed.

The financing process for GPU clusters carries inherent risks due to thin margins and large volumes. Lenders often hesitate to take on the risk that developers may default on their loans, while developers are wary of being unable to sell their clusters. This dynamic leads to the necessity of transferring risk to customers via fixed-price, long-term contracts.

If customer risk isn't effectively mitigated, a market bubble can form. Unlike traditional SaaS models, application layer companies engage in multi-year contracts for compute and inference while offering customers monthly subscriptions. A miscalculation in purchasing can spell disaster; a small change in revenue growth could lead to profits or bankruptcy. Imagine a world where companies could exit their contracts by selling them back to the market.

As AI technology scales, compute power will increasingly only be available for those who can manage the associated risks. A small startup in a San Francisco Victorian house cannot feasibly commit to a 5-year, take-or-pay contract for $100 million supercomputers, but they might be able to purchase a month of liquidity that someone else has sold back.

That’s the market we’re building: a liquid marketplace for GPU offtake.

About the Role

As part of our infrastructure team, you will help design and deploy some of the most powerful GPU clusters in existence, with even smaller clusters today having ranked in the TOP500 five years ago. Your responsibilities will include participating in on-call rotations, deploying new environments, troubleshooting issues, and embracing automation to facilitate large-scale deployments. As a member of a small but dynamic team, you'll have the opportunity to significantly influence our company culture, mentor junior engineers, and engage directly with our customers.

About sfcompute

sfcompute is pioneering a transformative approach to infrastructure development, focusing on risk mitigation in the financing of GPU clusters and data centers. Our vision is to create accessible solutions for a diverse range of customers, enabling them to harness the power of high-performance computing without the associated financial burdens.

Similar jobs

1 - 20 of 6,832 Jobs

Search for Senior Hpc Gpu Infrastructure Engineer

6,832 results

Select all on this page (20)

Apply

Senior HPC & GPU Infrastructure Engineer

Sciforium

Full-time|On-site|San Francisco

At Sciforium, we are at the forefront of AI infrastructure, pioneering advanced multimodal AI models and an innovative, high-efficiency serving platform. With substantial backing from AMD and a dedicated team of engineers, we are rapidly expanding our capabilities to support the next generation of frontier AI models and real-time applications.About the RoleWe are looking for a highly skilled Senior HPC & GPU Infrastructure Engineer who will be responsible for ensuring the health, reliability, and performance of our GPU compute cluster. As the primary custodian of our high-density accelerator environment, you will serve as the crucial link between hardware operations, distributed systems, and machine learning workflows. This position encompasses a range of responsibilities, from hands-on Linux systems engineering and GPU driver setup to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you are passionate about optimizing hardware performance, enjoy troubleshooting GPUs at scale, and aspire to create world-class AI infrastructure, we would love to hear from you.Your Responsibilities1. System Health & Reliability (SRE)On-Call Response: Be the primary responder for system outages, GPU failures, node crashes, and other cluster-wide incidents, ensuring rapid issue resolution to minimize downtime.Cluster Monitoring: Develop and maintain monitoring protocols for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and general system load.Vendor Liaison: Collaborate with data center personnel, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster.2. Linux & Network AdministrationOS Management: Oversee the installation, patching, and maintenance of Linux distributions (Ubuntu / CentOS / RHEL), ensuring consistent configuration, kernel tuning, and automation for large node fleets.Security & Access Controls: Set up VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computing infrastructure.Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity and administer distributed file systems like NFS, GPFS, or Lustre.3. GPU & ML Stack EngineeringDeployment & Bring-Up: Spearhead the deployment of new GPU nodes, including BIOS configuration and software integration to ensure optimal performance.

Jan 7, 2026

Apply

Software Engineer, GPU Infrastructure - HPC

OpenAI

Full-time|On-site|San Francisco

About Our TeamJoin the Fleet team at OpenAI, where we empower groundbreaking research and product innovation through our advanced computing infrastructure. We manage extensive systems across data centers, GPUs, and networking, ensuring optimal performance, high availability, and efficiency. Our work is crucial in enabling OpenAI’s models to function seamlessly at scale, supporting both our internal research endeavors and external products like ChatGPT. We are committed to prioritizing safety, reliability, and the ethical deployment of AI technology.About the RoleAs a Software Engineer on the Fleet High Performance Computing (HPC) team, you will play a vital role in ensuring the reliability and uptime of OpenAI’s compute fleet. Minimizing hardware failures is essential for smooth research training progress and uninterrupted services, as even minor hardware issues can lead to significant setbacks. With the rise of large supercomputers, the stakes in maintaining efficiency and stability have never been higher.At the cutting edge of technology, we often lead the charge in troubleshooting complex, state-of-the-art systems at scale. This is a unique opportunity for you to engage with groundbreaking technologies and create innovative solutions that enhance the health and efficiency of our supercomputing infrastructure.Our team fosters a culture of autonomy and ownership, enabling skilled engineers to drive meaningful change. In this role, you will focus on comprehensive system investigations and develop automated solutions to enhance our operations. We seek individuals who dive deep into challenges, conduct thorough investigations, and create scalable automation for detection and remediation.Key Responsibilities:Develop and maintain automation systems for provisioning and managing server fleets.Create tools to monitor server health, performance metrics, and lifecycle events.Collaborate effectively with teams across clusters, networking, and infrastructure.Work closely with external operators to maintain a high level of service quality.Identify and resolve performance bottlenecks and inefficiencies in the system.Continuously enhance automation processes to minimize manual intervention.You Will Excel in This Role if You Have:Experience in managing large-scale server environments.A blend of technical skills in systems programming and infrastructure management.Strong problem-solving abilities and a methodical approach to troubleshooting.Familiarity with high-performance computing technologies and tools.

Feb 5, 2026

Apply

High-Performance Computing (HPC) Hardware Engineer

sfcompute

Full-time|On-site|San Francisco, CA

At sfcompute, we are on a mission to revolutionize the infrastructure landscape by minimizing the risks associated with the largest build-outs in history.When financing GPU clusters and the data centers that support them, having a contract in place—what we call an "offtake"—is crucial. This ensures that customers have signed on to lease the cluster even before it’s constructed.The financing process for GPU clusters carries inherent risks due to thin margins and large volumes. Lenders often hesitate to take on the risk that developers may default on their loans, while developers are wary of being unable to sell their clusters. This dynamic leads to the necessity of transferring risk to customers via fixed-price, long-term contracts.If customer risk isn't effectively mitigated, a market bubble can form. Unlike traditional SaaS models, application layer companies engage in multi-year contracts for compute and inference while offering customers monthly subscriptions. A miscalculation in purchasing can spell disaster; a small change in revenue growth could lead to profits or bankruptcy. Imagine a world where companies could exit their contracts by selling them back to the market.As AI technology scales, compute power will increasingly only be available for those who can manage the associated risks. A small startup in a San Francisco Victorian house cannot feasibly commit to a 5-year, take-or-pay contract for $100 million supercomputers, but they might be able to purchase a month of liquidity that someone else has sold back.That’s the market we’re building: a liquid marketplace for GPU offtake.About the RoleAs part of our infrastructure team, you will help design and deploy some of the most powerful GPU clusters in existence, with even smaller clusters today having ranked in the TOP500 five years ago. Your responsibilities will include participating in on-call rotations, deploying new environments, troubleshooting issues, and embracing automation to facilitate large-scale deployments. As a member of a small but dynamic team, you'll have the opportunity to significantly influence our company culture, mentor junior engineers, and engage directly with our customers.

Feb 25, 2026

Apply

Principal/Staff HPC Network Engineer

sfcompute

Full-time|On-site|San Francisco, CA

At sfcompute, we are pioneering a transformative approach to GPU cluster financing, enabling the largest infrastructure build-out in history while effectively mitigating risk.In the ever-evolving landscape of GPU technology, securing financing for GPU clusters and the essential infrastructure they require involves inherent risks. Our innovative model ensures that developers can lease clusters through fixed-price long-term contracts, thus offloading risk to the customer while maintaining financial stability.As AI and computational demands grow, our mission is to democratize access to powerful computing resources. We aim to create a liquid market for GPU offtake, allowing startups and smaller enterprises to thrive without the burden of long-term contracts that aren't feasible for them.Role OverviewJoin our dynamic infrastructure team, responsible for architecting and deploying cutting-edge GPU clusters globally. You'll play a crucial role in maintaining operational excellence, engaging in on-call rotations, and driving automation to facilitate large-scale deployments. As a key member of our small but ambitious team, you will help shape our culture, mentor junior engineers, and learn directly from our customers.

Feb 25, 2026

Apply

HPC Solutions Architect - Remote Opportunity

lavendo

Full-time|$225K/yr - $315K/yr|Remote|San Francisco

About UsAt Lavendo, we are pioneering an infrastructure that most engineers only dream of. We operate an AI-centric cloud platform that integrates expansive GPU clusters, high-speed networking, and cloud-native tools, catering to enterprises, innovative startups, and leading research teams. Our mission is straightforward: empower our clients to efficiently train and execute complex AI and simulation workloads without the need to construct their own supercomputers.As a publicly traded company, we are rapidly expanding, with R&D centers across North America, Europe, and the Middle East. Our culture emphasizes engineering excellence: minimal bureaucracy, significant ownership, and a focus on tackling challenging infrastructure problems while witnessing the impact of our work on real customer operations.Your Role as HPC Specialist Solutions ArchitectIn this pivotal role, you will be the go-to expert for customers looking to establish or enhance advanced GPU and HPC environments in the cloud. This includes multi-rack clusters, high-speed interconnects, intricate scheduling, and strict SLAs regarding throughput and latency.As an HPC Specialist Solutions Architect, you will design and optimize cutting-edge platforms for AI training, extensive simulations, and data-intensive workloads. You will collaborate closely with NVIDIA's latest hardware (Hopper, Blackwell, and future versions), NVLink/NVSwitch topologies, and InfiniBand/RoCE fabrics, having a substantial influence on the evolution of our platform and reference architectures. If you thrive on translating workloads into optimized clusters and maximizing performance, this is the ideal position for you.Your ResponsibilitiesCluster Design: Architect and implement HPC clusters for AI, simulation, and distributed training using Kubernetes and schedulers like Slurm. Your considerations will include node types, GPU topology, queues, partitions, and failure scenarios.Infrastructure Optimization: Integrate NVIDIA Hopper and Blackwell-class GPUs with NVLink/NVSwitch and InfiniBand/RoCE, ensuring the hardware layout aligns with the communication patterns of the workloads.Automation: Deploy and manage GPU and Network Operators to standardize drivers, CUDA, firmware, and high-speed networking across extensive fleets, rather than managing on a box-by-box basis.Supercomputer Cloud Functionality: Design and validate cloud-native HPC environments that emulate supercomputer capabilities.

Feb 6, 2026

Apply

Senior AI Infrastructure Engineer

Hyperbolic Labs

Full-time|On-site|San Francisco, CA

Join Our MissionAt Hyperbolic Labs, we are dedicated to democratizing artificial intelligence by eliminating barriers to computing power through our Open-Access AI Cloud. We aggregate global computing resources to provide an innovative GPU marketplace and AI inference service, making AI affordable and accessible for everyone. As pioneers at the crossroads of AI and open-source technology, we envision a future where AI innovation is driven by imagination, not resource limitations. We invite forward-thinking individuals who share our vision of making AI universally accessible, secure, and cost-effective to join us in crafting a platform that empowers innovators to realize their groundbreaking AI projects.As we gear up for expansion following our Series A funding, our team, led by co-founders with PhDs in AI, Mathematics, and Computer Science, is set to transform the landscape of computing.The RoleWe are on the lookout for a Senior Infrastructure Engineer to drive the development and scaling of Hyperbolic's GPU Cloud Marketplace. In this pivotal role, you will create a multi-tenancy provisioning and virtualization solution that transforms raw GPUs from diverse global suppliers into a programmable, orchestrated resource pool serving thousands of AI developers and researchers. You will work at the forefront of cloud infrastructure, building the core orchestration layer that allows our platform to deliver cost savings of up to 75% compared to traditional cloud providers.

Mar 26, 2026

Apply

Product Manager, GPU Infrastructure NPI

Fluidstack

Full-time|$150K/yr - $250K/yr|On-site|San Francisco, CA

About FluidstackFluidstack is at the forefront of building groundbreaking infrastructure designed for the future of intelligence. We collaborate with premier AI research labs, government entities, and leading enterprises like Mistral, Poolside, Black Forest Labs, and Meta to deliver compute solutions at unparalleled speeds.Our mission is to expedite the realization of Artificial General Intelligence (AGI). Our team is dedicated, passionate, and driven to create world-class infrastructure, treating our clients' success as our own. If you possess a strong sense of purpose, a dedication to excellence, and the willingness to work diligently to transform the future of intelligence, we welcome you to join us in shaping what lies ahead.About the RoleWe are seeking a Product Manager to spearhead New Product Introduction (NPI) for our GPU infrastructure. You will collaborate with our datacenter, infrastructure, and networking teams to launch new GPU SKUs and compute solutions. Your role will involve defining the frameworks through which Fluidstack assesses, qualifies, and brings new GPU generations to market—from NVIDIA Blackwell and Rubin to AMD MI300X and future accelerators. This highly cross-functional position demands strong technical acumen, adept vendor relationship management, and a clear understanding of how hardware capabilities align with customer workload requirements. By doing so, you will help ensure that Fluidstack remains a leader in providing optimal compute options tailored for training, inference, and specialized AI workloads.Key ResponsibilitiesManage the NPI roadmap for GPU SKUs, including evaluation criteria, qualification timelines, and market strategies for new hardware generations.Collaborate with datacenter teams to establish requirements for power delivery (HVDC/LVDC), cooling systems (liquid vs. air), rack architecture, and the physical infrastructure necessary for next-gen GPUs.Engage with infrastructure engineers to validate hardware performance across essential metrics: training throughput (MFU), inference latency (TTFT, TBT), memory bandwidth, and interconnect topology (NVLink, InfiniBand).Foster vendor relationships with NVIDIA, AMD, and emerging XPU providers—conducting in-depth technical discussions, negotiating supply agreements, and overseeing early access programs.Define product specifications for system configurations: single-GPU instances, multi-GPU nodes, full rack deployments, and megacluster architectures.Analyze customer workload profiles to identify the optimal GPU mix: H100 for large model training, L40S for inference, B200 for frontier research, and MI300X for cost-sensitive workloads.Develop business cases for new SKU introductions.

Mar 3, 2026

Apply

Technical Intern - GPU Optimization and AI Infrastructure

Wafer

Internship|On-site|San Francisco

About the RoleWe invite you to join our innovative team at Wafer as a Technical Intern, where you will have the opportunity to shape the future of inference, GPU optimization, and AI infrastructure. As a full-time engineer, you will collaborate closely with our team to define our technical direction and develop the core systems that drive our GPU optimization platform.Your ResponsibilitiesDesign and implement scalable infrastructure for AI model training and inference.Make pivotal technical decisions and influence architectural choices.

Oct 15, 2025

Apply

Go-to-Market Champion for GPU & AI Infrastructure

Impossible Cloud

Full-time|Hybrid|On-site/ Hybrid / Remote

Group: Impossible Cloud / Impossible Cloud Network (ICN)Focus: Integrating Enterprise Storage with Decentralized GPU OrchestrationOur MissionAt Impossible Cloud, we are transforming enterprise storage through our patented decentralized object storage technology, delivering a high-performance, cost-effective infrastructure. We aim to expand this foundation by creating a next-generation AI-first platform that integrates storage, compute, and GPU functionalities.We are looking for a dynamic and hands-on Go-to-Market Champion specializing in AI and GPU Infrastructure to accelerate Impossible Cloud's position in the market for Agentic AI infrastructure. This is an exceptional opportunity to join a rapidly growing AI infrastructure company during a critical phase, owning the GTM strategy from development to scaling a successful sales organization.In this role, you will collaborate closely with founders, Product, Marketing, and Customer Success teams to transform our viral product into a reliable, scalable revenue machine for enterprises. Our culture thrives on relentless innovation, accountability, and ownership, where each team member is dedicated to excellence and urgency in their work.Key Responsibilities- Develop and execute Impossible Cloud’s global Go-to-Market (GTM) strategy, focusing on market segmentation, value propositions, pricing, and packaging for GPU cloud and AI infrastructure tailored to enterprises, startups, and research entities.- Create scalable customer acquisition and retention strategies through direct sales, channels, and partnerships, enhancing commercial enablement and managing the customer journey (both commercial and technical).- Build and lead a high-performing global GTM team encompassing presales, direct sales, partnerships, solutions engineering, marketing, and customer success, while developing playbooks and performance metrics to instill a culture of customer focus and excellence.- Work closely with Product and Engineering to align GTM strategies with the product roadmap, integrating direct customer insights, and gathering market intelligence to anticipate trends in AI and cloud technology adoption.- Identify, negotiate, and lead strategic partnerships with AI firms, ISVs, integrators, and cloud marketplaces, while engaging with Enterprise and AI Native clients as a trusted advisor.

Feb 26, 2026

Apply

GPU Performance Engineer

Genmo

Full-time|On-site|San Francisco HQ

At Genmo, we are at the forefront of advancing artificial intelligence through innovative research in video generation. Our mission is to construct open, cutting-edge models that will ultimately contribute to the realization of Artificial General Intelligence (AGI). As part of our dynamic team, you will play a pivotal role in redefining the future of AI and expanding the horizons of video creation.We are looking for a skilled GPU Performance Engineer who can extract maximum performance from our H100 infrastructure and fine-tune our model serving stack to achieve unparalleled efficiency. If you are passionate about optimizing performance, particularly at the microsecond level, and thrive on pushing hardware to its limits, this is the perfect opportunity for you.Key ResponsibilitiesUtilize advanced profiling tools such as Nsight Systems and nvprof to analyze and enhance GPU workloads.Develop high-performance CUDA and Triton kernels to optimize essential model functions.Reduce cold start latency from seconds to mere milliseconds in our serving infrastructure.Optimize memory access patterns, implement kernel fusion, and maximize GPU utilization.Collaborate closely with machine learning engineers to optimize model implementations.Diagnose and resolve performance issues throughout the application and hardware stack.Implement custom memory pooling and allocation strategies to enhance performance.Promote performance optimization techniques and foster a culture of excellence across teams.

Jul 17, 2025

Apply

GPU Kernel Engineer

Sciforium

Full-time|On-site|San Francisco

At Sciforium, we are at the forefront of AI infrastructure, innovating next-generation multimodal AI models and a proprietary high-efficiency serving platform. With substantial funding and direct collaboration from AMD, supported by their engineers, our team is rapidly expanding to develop the complete stack that powers cutting-edge AI models and real-time applications.About the RoleWe are on the lookout for a talented GPU Kernel Engineer who is eager to explore and maximize performance on modern accelerators. In this role, you will be responsible for designing and optimizing custom GPU kernels that drive our advanced large-scale AI systems. You will navigate the hardware-software stack, engaging in low-level kernel development and integrating optimized operations into high-level machine learning frameworks for large-scale training and inference.This position is perfect for someone who excels at the intersection of GPU programming, systems engineering, and state-of-the-art AI workloads, and aims to contribute significantly to the efficiency and scalability of our machine learning platform.Key ResponsibilitiesDevelop, implement, and enhance custom GPU kernels utilizing C++, PTX, CUDA, ROCm, Triton, and/or JAX Pallas.Profile and fine-tune the end-to-end performance of machine learning operations, particularly for large-scale LLM training and inference.Integrate low-level GPU kernels into frameworks such as PyTorch, JAX, and our proprietary internal runtimes.Create performance models, pinpoint bottlenecks, and deliver kernel-level enhancements that significantly boost AI workloads.Collaborate with machine learning researchers, distributed systems engineers, and model-serving teams to optimize computational performance across the entire stack.Engage closely with hardware vendors (NVIDIA/AMD) and stay updated on the latest GPU architecture and compiler/toolchain advancements.Contribute to the development of tools, documentation, benchmarking suites, and testing frameworks ensuring correctness and performance reproducibility.Must-Haves5+ years of industry or research experience in GPU kernel development or high-performance computing.Bachelor’s, Master’s, or PhD in Computer Science, Computer Engineering, Electrical Engineering, Applied Mathematics, or a related discipline.Strong programming proficiency in C++, Python, and familiarity with machine learning frameworks.

Dec 6, 2025

Apply

GPU Kernel Engineer

Baseten

Full-time|On-site|San Francisco

ABOUT BASETENAt Baseten, we empower the world's leading AI firms—such as Cursor, Notion, and OpenEvidence—by delivering mission-critical inference solutions. Our unique blend of applied AI research, robust infrastructure, and user-friendly developer tools enables AI pioneers to effectively deploy groundbreaking models. With our recent achievement of a $300M Series E funding round supported by esteemed investors like BOND and IVP, we're on an exciting growth trajectory. Join our dynamic team and contribute to the platform that drives the next generation of AI products.THE ROLEWe are looking for an experienced Senior GPU Kernel Engineer to join our innovative team at the forefront of AI acceleration. In this role, your programming expertise will directly enhance the performance of cutting-edge machine learning models. You'll be responsible for developing highly efficient GPU kernels that optimize computational processes, allowing for transformative AI applications.You'll thrive in a fast-paced, intellectually challenging environment where your technical skills are pivotal. Your contributions will directly affect production systems that serve millions of users across various platforms. This position offers exceptional opportunities for career advancement for engineers enthusiastic about low-level optimization and impactful systems engineering.EXAMPLE INITIATIVESAs part of our Model Performance team, you will engage in projects like:Baseten Embeddings Inference: The quickest embeddings solution availableThe Baseten Inference StackEnhancing model performance optimizationRESPONSIBILITIESCore Engineering ResponsibilitiesDesign and develop high-performance GPU kernels for essential machine learning operations, including matrix multiplications and attention mechanisms.Collaborate with cross-functional teams to drive performance improvements and implement optimizations.Debug and refine kernel code to achieve maximal efficiency and reliability.Stay abreast of the latest advancements in GPU technology and machine learning frameworks.

Jul 17, 2025

Apply

Global GPU Commodity Manager

Andromeda Cluster

Full-time|Remote|Global Remote / San Francisco, CA

Location: North America Remote / San Francisco · Full-TimeAbout AndromedaFounded by Nat Friedman and Daniel Gross, Andromeda Cluster provides early-stage startups with access to scaled AI infrastructure, once exclusive to hyperscalers. Our journey began with a single managed cluster that rapidly gained demand, leading us to develop a robust system, network, and orchestration layer to democratize AI infrastructure.Today, we partner with leading AI labs, data centers, and cloud providers to efficiently deliver compute resources wherever needed. Our platform expertly routes training and inference jobs across global supply chains, promoting flexibility and efficiency in one of the fastest-growing markets in the world.Our vision is to create a liquidity layer for global AI compute, and we are on the lookout for bright minds in AI infrastructure, research, and engineering to join our expanding team.The OpportunityWe are seeking a dedicated Global GPU Commodity Manager to enhance the supply and demand matching on our platform. This role is an Individual Contributor position reporting to the Head of Infrastructure. The Infrastructure team is pivotal to our operations, responsible for acquiring and facilitating compute resources across the organization while collaborating closely with compute providers, sales, and technical teams to align supply with demand.With a solid foundation established with our providers, we are now scaling to expand our network and liquidity, broaden our service offerings, and accelerate our growth trajectory.What You'll DoMatch incoming leads from the sales team to internal and external market capacity.Maximize utilization of compute resources.Source and onboard new compute suppliers globally.Identify capacity based on customer requirements and market trends.Resolve customer and supplier challenges in a fast-paced environment.Analyze technical and commercial differences between suppliers to optimize our capacity funnel.Develop a proactive compute strategy driven by market intelligence.Negotiate costs with suppliers and other vendors.Create and implement processes around capacity planning.

Mar 25, 2026

Apply

Technical Staff Engineer - GPU Optimization at Wafer | San Francisco

Wafer

Full-time|On-site|San Francisco

About the PositionAt Wafer, we are on a mission to enhance the intelligence per watt by developing AI systems that can self-optimize. Our journey begins with GPU kernels, and we aim to revolutionize every aspect of ML systems and AI infrastructure. We are a compact, dynamic team of four, supported by renowned investors including Fifty Years, Y Combinator, Jeff Dean, and Woj Zaremba, co-founder of OpenAI. We are seeking passionate engineers eager to innovate at the convergence of AI agents and systems programming.In this role, you will collaborate closely with our founding team to create the systems that power our GPU optimization platform. Your projects will range from the agent framework that refines kernels to the profiling infrastructure that interfaces with NCU and ROCprofiler, as well as the compiler tools that scrutinize PTX and SASS.

Feb 4, 2026

Apply

Member of Technical Staff - GPU Infrastructure

Prime Intellect

Full-time|On-site|San Francisco

Join Our Mission to Build Open Superintelligence InfrastructureAt Prime Intellect, we are pioneering the development of an open superintelligence stack that encompasses cutting-edge agentic models and the infrastructure that empowers anyone to create, train, and deploy these advanced AI systems. Our innovative approach aggregates and orchestrates global computational resources into a cohesive control plane, complemented by a comprehensive reinforcement learning (RL) post-training toolkit that includes environments, secure sandboxes, verifiable evaluations, and our asynchronous RL trainer. We provide researchers, startups, and enterprises with the capabilities to execute end-to-end reinforcement learning at unparalleled scale, adapting models to real-world tools, workflows, and deployment scenarios.As a Solutions Architect for GPU Infrastructure, you will be the technical authority responsible for translating customer needs into robust, production-ready systems designed to train the world’s most sophisticated AI models.With a recent funding round raising $15 million (totaling $20 million) led by Founders Fund, alongside contributions from Menlo Ventures and illustrious angels such as Andrej Karpathy (Tesla, OpenAI), Tri Dao (Together AI), Dylan Patel (SemiAnalysis), Clem Delangue (Huggingface), and Emad Mostaque (Stability AI), we are poised for significant growth and innovation.Key Technical ResponsibilitiesThis role requires a blend of deep technical knowledge and hands-on implementation skills. Your contributions will be crucial in:Customer Architecture & DesignCollaborating with clients to comprehend workload specifications and architect optimal GPU cluster solutions.Drafting technical proposals and conducting capacity planning for clusters ranging from 100 to over 10,000 GPUs.Formulating deployment strategies for large language model (LLM) training, inference, and high-performance computing (HPC) tasks.Delivering architectural recommendations to both technical teams and executive stakeholders.Infrastructure Deployment & OptimizationImplementing and configuring orchestration frameworks such as SLURM and Kubernetes for distributed workloads.Establishing high-performance networking through InfiniBand, RoCE, and NVLink interconnects.Enhancing GPU utilization, memory management, and inter-node communication.Setting up parallel file systems (Lustre, BeeGFS, GPFS) to maximize I/O efficiency.Tuning system performance, from kernel parameters to CUDA configurations.Production Operations & SupportEnsuring the reliability and performance of GPU infrastructure through continuous monitoring and support.Collaborating with cross-functional teams to troubleshoot and optimize operational workflows.Documenting processes and creating training materials for team members and clients.

Aug 30, 2025

Apply

Technical Internship in AI and GPU Optimization

Wafer

Internship|On-site|San Francisco

About the RoleWe're excited to invite you to join wafer as a Spring Intern, where you will play a crucial role in shaping the future of AI infrastructure and GPU optimization. As part of our innovative team, you will work closely with full-time engineers to define our technical strategies and contribute to the development of the essential systems that drive our GPU optimization platform.Your ResponsibilitiesDesign and implement scalable infrastructure for AI model training and inference tasks.Guide the team in making technical decisions and architectural choices.Qualifications We SeekEssential Technical SkillsGPU Fundamentals: A strong grasp of GPU architectures, CUDA programming, and parallel computing methodologies.Deep Learning Frameworks: Skilled in PyTorch, TensorFlow, or JAX, especially for GPU-accelerated applications.Knowledge of LLM/AI: Solid foundation in large language models, including training, fine-tuning, prompting, and evaluation.Systems Engineering: Proficient in C++, Python, and potentially Rust/Go for developing tools around CUDA.Preferred BackgroundPublications or contributions to open-source projects related to inference GPU computing or ML/AI are advantageous.Hands-on experience in conducting large-scale experiments, benchmarking, and performance optimization.

Oct 15, 2025

Apply

ML Infrastructure Engineer

Sygaldry Technologies

Full-time|On-site|San Francisco

About Sygaldry Technologies Sygaldry Technologies develops quantum-accelerated AI servers in San Francisco, focusing on faster AI training and inference. By combining quantum technology with artificial intelligence, the team addresses challenges in computing costs and energy efficiency. Their AI servers integrate multiple qubit types within a fault-tolerant system, aiming for a balance of cost, scalability, and speed. The company values optimism, rigor, and a drive to solve complex problems in physics, engineering, and AI. Role Overview: ML Infrastructure Engineer The ML Infrastructure Engineer joins the AI & Algorithms team, which includes research scientists, applied mathematicians, and quantum algorithm specialists. This role centers on building and maintaining the compute infrastructure that powers advanced research. The systems you build will support reliable GPU access, reproducible experiments, and scalable workloads, so researchers can focus on their core work without needing deep cloud expertise. Expect to design and manage compute platforms for a range of tasks, including quantum circuit simulation, large-scale numerical optimization, model training, tensor network contractions, and high-throughput data generation. These workloads span multiple cloud providers and on-premises GPU servers. Key Responsibilities Develop compute abstractions for diverse workloads, such as GPU-accelerated simulations, distributed training, high-throughput CPU jobs, and interactive analyses using frameworks like PyTorch and JAX. Set up infrastructure to support experiment tracking and reproducibility. Create developer tools that make cloud computing feel local, streamlining environment setup, job submission, monitoring, and artifact management. Scale experiments from single-GPU prototypes to large, multi-node production runs. Multi-Cloud GPU Orchestration Design orchestration strategies for workloads across multiple cloud providers, optimizing job routing for cost, availability, and capability. Monitor and improve cloud spending, keeping track of credit balances, burn rates, and expiration dates.

Apr 14, 2026

Apply

Software Engineer, Fleet Infrastructure

OpenAI

Full-time|Hybrid|San Francisco

Join the Fleet Infrastructure team at OpenAI, where you will play a pivotal role in managing and enhancing one of the world's largest and most efficient GPU fleets, dedicated to powering OpenAI's advanced model training and deployment initiatives. Your contributions will range from:Developing user-friendly scheduling and quota systems to maximize GPU utilization.Creating automated solutions for seamless Kubernetes cluster provisioning and upgrades, ensuring a robust and low-maintenance platform.Building service frameworks and deployment systems that support diverse research workflows.Enhancing model startup times through high-performance snapshot delivery, leveraging advanced blob storage and hardware caching techniques.And much more!About the RoleAs a Software Engineer in Fleet Infrastructure, you will design, develop, deploy, and maintain essential infrastructure systems that facilitate model training and deployment on a massive GPU fleet. This role presents an exciting opportunity to influence a critical system that supports OpenAI's mission to responsibly advance AI capabilities, all while working in a fast-paced environment with tight deadlines.Positioned in San Francisco, CA, we embrace a hybrid work model, encouraging three days in the office each week, along with offering relocation assistance for new hires.In this role, you will:Design, implement, and manage components of our compute fleet, focusing on job scheduling, cluster management, snapshot delivery, and CI/CD systems.Collaborate closely with research and product teams to understand and meet workload requirements effectively.Work alongside hardware, infrastructure, and business teams to deliver a service characterized by high utilization and reliability.

Feb 13, 2025

Apply

Bioinformatics Software Engineer (GPU Accelerated)

Prima Mente

Full-time|On-site|San Francisco

Join Prima Mente: A Leader in Biology AIAt Prima Mente, we are redefining the frontier of biology through artificial intelligence. Our mission is to generate unique datasets, develop versatile biological foundation models, and translate groundbreaking discoveries into impactful research and clinical outcomes. With a commitment to understanding the complexities of the brain, we aim to shield it from neurological diseases while enhancing overall health. Our diverse team of AI researchers, experimentalists, clinicians, and operational experts operates across London, San Francisco, and Dubai.Role Overview: GPU/CPU-Accelerated BioinformaticsWe are seeking a skilled Bioinformatics Software Engineer to architect and implement scalable production pipelines for processing multi-omics data. The successful candidate will enable rapid transitions from hypothesis to patent-ready solutions in a matter of months.Key Responsibilities:Design and implement bioinformatics pipelines optimized for GPU/CPU utilization utilizing tools like Flyte and Nextflow, capable of processing over 1,000 samples at scale.Optimize performance and cost efficiency by leveraging GPU/CPU acceleration where it provides the greatest benefit.Collaborate with experimental and machine learning teams to validate computational results and align processing with model requirements.Foster and manage collaborations with academic and industrial research partners.Growth Expectations1 Month: You will be deploying your workflows on GPU/CPU-accelerated cloud infrastructure to process multi-omic experiments, while building relationships with AI/ML and wet lab teams to understand their requirements.3 Months: Your optimized pipelines will be processing thousands of samples with substantial speed enhancements and reduced costs, yielding publication and patent-ready outcomes.6 Months: Your automated pipelines will support daily AI model training, and you will co-design experiments alongside AI/ML engineers, leading technical execution on external collaborations.Your ProfileYou are passionate about pushing the boundaries of AI and biology. As an engineer rather than an analyst, you thrive on enhancing performance and efficiency while architecting robust systems. You are comfortable making rapid technical decisions and iterate quickly.Desired QualificationsExperience in bioinformatics, computational biology, or a related field.Proficiency in software engineering, particularly in developing scalable data processing pipelines.Strong understanding of multi-omics data and methods.Familiarity with GPU/CPU acceleration techniques.Excellent communication and collaboration skills.

Mar 2, 2026

Apply

Cloud Software Engineer

sfcompute

Full-time|On-site|San Francisco, CA

At sfcompute, we are pioneering a solution to mitigate the risks associated with one of the largest infrastructure build-outs in history.Our focus is on transforming how GPU clusters are financed. By creating a robust market for GPU offtake, we enable companies to secure contracts for leasing clusters even before they are constructed, significantly reducing financial risks involved.We recognize that financing GPU clusters can be perilous due to slim margins and massive volumes. Lenders are hesitant to take on the risks associated with cluster developers, while developers are equally wary of unsold clusters. Our innovative approach of utilizing fixed-price long-term contracts allows us to transfer this risk to customers effectively.As AI technology advances, only those who can shoulder the financial risk will have access to required computing power. We aim to democratize access by allowing smaller entities, such as startups, to engage in the GPU market without the burden of extensive long-term contracts.Join us in creating a dynamic and liquid market for GPU offtake that empowers businesses of all sizes.About the RoleWe are seeking a proactive Software Engineer to contribute to the development of our compute delivery platform that supports our innovative offtake machine. In this position, you will design and implement advanced systems that seamlessly connect our compute marketplace with the orchestration software for virtual machines operating on state-of-the-art HPC hardware.

Dec 12, 2025

Create account — see all 6,832 results