Technical Staff Member Supercomputing Platform Infrastructure jobs in San Francisco – Browse 2,443 openings on RoboApply Jobs

Technical Staff Member - Supercomputing Platform & Infrastructure

magic.devSan Francisco

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Strong foundation in systems engineering principles. Extensive hands-on experience with Terraform, including module design, state management, environment isolation, and large-scale implementations.

About the job

Role Overview

As a vital member of our Supercomputing Platform & Infrastructure team, you will be instrumental in designing, constructing, and managing the extensive GPU infrastructure that underpins Magic’s model training and inference processes.

A key aspect of your role will involve leveraging Terraform-driven infrastructure-as-code methodologies to build and maintain our infrastructure, ensuring reproducibility, reliability, and operational clarity across clusters comprising thousands of GPUs.

Magic’s long-context models exert continuous demands on compute, networking, and storage systems. The infrastructure must support long-running distributed jobs, high-throughput data movement, and stringent availability requirements, necessitating designs that are automated, observable, and resilient. You will take ownership of the systems and IaC foundations that facilitate these capabilities.

This position has the potential to expand into broader responsibilities encompassing supercomputing platform architecture, influencing how Magic scales GPU clusters and enhances infrastructure reliability as model workloads expand.

Key Responsibilities

Design and manage large-scale GPU clusters for model training and inference.
Construct and sustain infrastructure utilizing Terraform across both cloud and hybrid environments.
Develop modular, scalable IaC frameworks for provisioning compute, networking, and storage resources.
Enhance deployment reproducibility, maintain environment consistency, and ensure operational safety.
Optimize networking and storage architectures for high-throughput AI workloads.
Automate fault detection and recovery mechanisms across distributed clusters.
Diagnose complex cross-layer issues involving hardware, drivers, networking, storage, operating systems, and cloud environments.
Enhance observability, monitoring, and reliability of essential platform systems.

Qualifications

Strong foundation in systems engineering principles.
Extensive hands-on experience with Terraform, including module design, state management, environment isolation, and large-scale implementations.

About magic.dev

Magic.dev is dedicated to building safe AGI that accelerates human progress in solving the world's most pressing challenges. Our innovative approach integrates advanced techniques to ensure the reliability and effectiveness of AI development.

Similar jobs

1 - 20 of 2,443 Jobs

Select all on this page (20)

Apply

Technical Staff Member - Supercomputing Platform & Infrastructure

magic.dev

Full-time|On-site|San Francisco

At Magic, our mission is to create safe AGI that propels humanity forward in addressing the world’s most critical challenges. We believe that the key to achieving safe AGI lies in automating research and code generation to enhance models and resolve alignment issues more effectively than humans alone. Our unique approach integrates frontier-scale pre-training, domain-specific reinforcement learning, ultra-long context, and inference-time computation to realize this vision.Role OverviewAs a vital member of our Supercomputing Platform & Infrastructure team, you will be instrumental in designing, constructing, and managing the extensive GPU infrastructure that underpins Magic’s model training and inference processes.A key aspect of your role will involve leveraging Terraform-driven infrastructure-as-code methodologies to build and maintain our infrastructure, ensuring reproducibility, reliability, and operational clarity across clusters comprising thousands of GPUs.Magic’s long-context models exert continuous demands on compute, networking, and storage systems. The infrastructure must support long-running distributed jobs, high-throughput data movement, and stringent availability requirements, necessitating designs that are automated, observable, and resilient. You will take ownership of the systems and IaC foundations that facilitate these capabilities.This position has the potential to expand into broader responsibilities encompassing supercomputing platform architecture, influencing how Magic scales GPU clusters and enhances infrastructure reliability as model workloads expand.Key ResponsibilitiesDesign and manage large-scale GPU clusters for model training and inference.Construct and sustain infrastructure utilizing Terraform across both cloud and hybrid environments.Develop modular, scalable IaC frameworks for provisioning compute, networking, and storage resources.Enhance deployment reproducibility, maintain environment consistency, and ensure operational safety.Optimize networking and storage architectures for high-throughput AI workloads.Automate fault detection and recovery mechanisms across distributed clusters.Diagnose complex cross-layer issues involving hardware, drivers, networking, storage, operating systems, and cloud environments.Enhance observability, monitoring, and reliability of essential platform systems.QualificationsStrong foundation in systems engineering principles.Extensive hands-on experience with Terraform, including module design, state management, environment isolation, and large-scale implementations.

Jan 25, 2024

Apply

Member of Technical Staff, AI Platform & Architecture (Infrastructure)

Postman, Inc.

Full-time|$256K/yr - $276K/yr|On-site|San Francisco, California, United States

Who Are We?Postman is the leading API platform worldwide, empowering over 45 million developers and 500,000 organizations, including 98% of the Fortune 500. We're committed to fostering an API-first world by simplifying the API lifecycle and enhancing collaboration, enabling users to create superior APIs with increased speed.Headquartered in San Francisco, we have expanded our presence with offices in Boston, New York, Austin, Tokyo, London, and Bangalore—the birthplace of Postman. As a privately held company, we are backed by esteemed investors such as Battery Ventures, BOND, Coatue, CRV, Insight Partners, and Nexus Venture Partners. Discover more about us at postman.com or connect with us on X via @getpostman.P.S: We highly encourage you to explore The "API-First World" graphic novel for insights into our vision and the larger narrative.The OpportunityAs a Member of Technical Staff focusing on AI Infrastructure, you'll be instrumental in developing and maintaining the core systems and distributed infrastructure crucial for AI model post-training, inference, and data pipelines. Your role will involve close collaboration with engineering and research teams to ensure the performance, scalability, and reliability of our essential AI systems.What You’ll DoDesign and implement large-scale, distributed AI infrastructure and services.Enhance performance for GPU/xPU accelerators and cloud environments.Develop tools for observability, reliability, and scalability of AI workloads.Collaborate with cross-functional teams to define AI infrastructure requirements and roadmap.Contribute to architectural design and ensure system longevity.About YouExperience with GenAI infrastructure systems, distributed systems, cloud computing, and high-performance infrastructure.Proficient in programming languages such as Python, Go, or equivalent.Understanding of scaling challenges specific to AI workloads and accelerators.

Mar 19, 2026

Apply

Technical Staff Member - Platform Engineering

Composio

Full-time|On-site|sf

Join Composio, where we are revolutionizing the infrastructure that empowers agents to seamlessly connect with the tools you utilize daily, including GitHub, Gmail, Notion, Salesforce, and more. Our dedicated team of engineers is tackling challenges from context management to search optimization, striving to create the most efficient bridge between your agents and their essential tools.Having secured a $25M Series A funding from Lightspeed, along with support from prominent angels such as Guillermo Rauch (CEO of Vercel), Dharmesh Shah (CTO of HubSpot), and Gokul Rajaram, we have experienced significant growth, tripling our ARR this year. Our customers range from fellow Y Combinator alumni to established companies like Wabi, Glean, and Zoom.Your ResponsibilitiesEnhance our platform primitives and APIs, including authentication, automatic refreshes, triggers, tool search, planning, and sandbox management.Oversee multiple runtimes for code execution across Lambdas and Firecracker.Optimize performance through tracing, CPU/heap profiling, database query enhancements, and workflow optimization.Collaborate closely with product engineering teams and customers to effectively manage their workloads and improve our product.Produce clear and comprehensive documentation.Essential QualificationsCore Platform Engineering Skills: Extensive experience in scaling backend distributed systems, maintaining reliable systems while delivering quickly, and managing multiple platform components simultaneously.AI Expertise: Familiarity with building and working with language models.Linux Proficiency: Comfortable working in a Linux environment.Effective Communication: Ability to write well-structured documentation and articulate complex ideas clearly.Interpersonal Skills: Cultivate trust and acknowledge areas for growth.Preferred QualificationsExperience with cloud infrastructure and serverless architecture.

Feb 10, 2026

Apply

Infrastructure Technical Staff Member

Vapi

Full-time|On-site|San Francisco

About Vapi:At Vapi, we are revolutionizing communication by making voice the primary interface for human interaction.Our platform offers unparalleled configurability for deploying voice agents.In just two years, we have attracted over 600,000 developers, with more than 2,000 joining daily.Experience Vapi now!Why We Need You:We handle millions of calls daily, with thousands occurring concurrently.Every call generates a new audio packet every 20 milliseconds, requiring responses in under 1 second.We are scaling this operation to manage hundreds of millions of calls.This challenge is exciting and incredibly rewarding.Your Responsibilities:30 Days: Get acquainted with our multi-cluster, multi-cloud infrastructure.60 Days: Launch a new service such as Anycast Global Router.90 Days: Take ownership of a domain, such as GPU inference clusters.Your Profile:You have experience from Series B to F funding stages.You have successfully scaled large, resilient, and high-performance systems.Bonus points if you've founded your own startup!Why Choose Vapi:Generational Impact: Create the human interface for every business.Ownership Culture: 70% of our team are previous founders.Supportive Team: Our founders, Jordan and Nikhil, bring that friendly Canadian spirit.Top Investors: Backed by Y Combinator, KP Seed, and Bessemer Series A.What We Provide:Equity Ownership: Competitive salary with excellent equity options.Health Coverage: Comprehensive medical, dental, and vision plans.Team Bonding: We enjoy spending time together, including quarterly off-site events.Flexible Time Off: Take the time you need to recharge.

Jul 29, 2025

Apply

Machine Learning Infrastructure Engineer - Supercomputing

Physical Intelligence

Full-time|On-site|San Francisco

At Physical Intelligence, we are pioneering general-purpose AI applications for the physical world. Our innovative approach involves orchestrating thousands of accelerators across a diverse ecosystem of GPU and TPU clusters, which encompass various hardware generations, cloud platforms, and cluster configurations.Researchers frequently encounter challenges in identifying the optimal cluster for their tasks, understanding resource availability, and configuring their workloads efficiently. This process is not scalable. To enhance productivity, we require an intelligent scheduling and compute system that can automatically determine the best job placements based on availability, hardware compatibility, cost considerations, and priority levels, allowing researchers to concentrate on their scientific endeavors.This position encompasses the complete ownership of this challenge: the development of scheduling systems, placement logic, cluster management frameworks, and operational tools essential for seamless operations.This role is distinct from traditional cloud DevOps; it focuses on resource allocation intelligence, utilization efficiency, fault tolerance, and ensuring a smooth experience for large-scale distributed training.About the TeamThe ML Infrastructure team is dedicated to bolstering and accelerating Physical Intelligence’s fundamental modeling initiatives by creating systems that ensure large-scale training is reliable, reproducible, and efficient. You will collaborate closely with the ML Infrastructure, data platform, and research teams to eliminate compute scheduling as a bottleneck.Key Responsibilities- Lead Intelligent Job Scheduling and Placement: Design and implement multi-tenant scheduling systems that automatically allocate training jobs to the most suitable cluster based on hardware specifications, topology, availability, cost, and priority. Facilitate equitable resource sharing across teams and projects through quota management, priority tiers, and preemption policies. Simplify cluster discrepancies so researchers can submit jobs without needing detailed knowledge of cluster specifics.- Enhance Multi-cluster Orchestration: Develop the control plane responsible for overseeing the job lifecycle across various clusters (including mixed GPU/TPU setups, multi-generational hardware, both on-premises and cloud-based) and enable effortless job migration, failover, and rescheduling.- Optimize Accelerator Utilization and Performance: Continuously monitor and enhance GPU/TPU usage across the entire fleet. Apply priority, preemption, queuing, and fairness strategies that balance research momentum with cost efficiency.- Guarantee Scalability and Stability: Implement fault detection, automatic recovery mechanisms, and resilience strategies for long-running multi-node training tasks. Oversee health checks, node management, and scaling strategies to ensure optimal performance.

Mar 7, 2026

Apply

Infrastructure & Scaling Technical Member

Parallel

Full-time|On-site|San Francisco or Palo Alto

About UsAt Parallel, we are a pioneering web infrastructure company dedicated to empowering businesses across various sectors, including sales, marketing, insurance, and software development. Our innovative products enable organizations to create cutting-edge AI agents with robust and flexible programmatic access to the web.Having successfully raised $130 million from esteemed investors such as Kleiner Perkins, Index Ventures, and Spark Capital, our mission is to reshape the web for AI applications. We are assembling a talented team of engineers, designers, marketers, and operational experts to help us achieve this vision.Job Overview: As a member of our technical staff, you will play a crucial role in building, operating, and scaling our infrastructure, particularly around large language models. Your responsibilities will include ensuring system reliability and cost-efficiency as we expand, anticipating potential bottlenecks, evolving our architecture to meet growing demands, and developing the tools that enhance engineering productivity.About You: You possess a deep understanding of distributed systems, cloud platforms, performance optimization, and scalable architecture. You are adept at balancing trade-offs between cost, reliability, and speed, and you are passionate about enabling teams to innovate rapidly and confidently while supporting products that serve millions of users seamlessly.

Aug 14, 2025

Apply

Technical Staff Member - Pre-Training Infrastructure

Reflection AI

Full-time|On-site|San Francisco

Our MissionAt Reflection AI, our goal is to develop open superintelligence and make it universally accessible.We are pioneering open weight models tailored for individuals, agents, enterprises, and even entire nations. Our diverse team comprises talented AI researchers and industry veterans from prestigious organizations such as DeepMind, OpenAI, Google Brain, Meta, Character.AI, Anthropic, and many more.Role OverviewConstruct and enhance distributed training systems that drive the pre-training of cutting-edge models.Collaborate with research teams to design and execute extensive training runs for foundational models.Create infrastructure that facilitates efficient training across thousands of GPUs leveraging contemporary distributed training frameworks.Enhance training throughput, stability, and efficiency for extensive model training tasks.Work closely with pre-training researchers to convert experimental concepts into scalable, production-ready training systems.Boost performance of distributed training tasks through optimization of communication, memory management, and GPU utilization.Develop and maintain training pipelines that accommodate large-scale datasets, checkpointing, and iterative experiments.Identify and resolve performance bottlenecks within distributed training systems, including model parallelism, GPU communication, and training runtime environments.Contribute to the creation of systems that promote swift experimentation and iteration on novel training methods.

Mar 24, 2026

Apply

Infrastructure Security Engineer - Member of Technical Staff

Reflection AI

Full-time|On-site|San Francisco

About the Role Reflection AI is hiring a Member of Technical Staff focused on Infrastructure Security in San Francisco. This position plays a key part in protecting the company’s infrastructure from security threats. What You Will Do Work with teams across the company to design, implement, and monitor security protocols and systems Help safeguard digital assets by maintaining the integrity and security of infrastructure

Apr 16, 2026

Apply

Founding Member of Technical Staff at tierzero | San Francisco

tierzero

Full-time|Hybrid|SF HQ

About tierzero tierzero builds tools that help engineering teams manage production code with stronger incident response, better operational visibility, and collaborative knowledge sharing. Companies like Discord, Drata, and Framer use tierzero to support their infrastructure in an AI-driven landscape. Backed by $7 million from investors including Accel and SV Angel, tierzero is growing quickly from its San Francisco headquarters. Role Overview: Founding Member of Technical Staff This is a hands-on role shaping tierzero’s core product and systems from the ground up. The founding technical team works closely with the CEO, CTO, and early customers to solve real engineering challenges. The position is based in San Francisco, with a hybrid schedule: three days each week in the office. What You’ll Do Design and build intelligent AI systems that process large volumes of unstructured data Deliver full-stack features informed by real-time user feedback Improve usability so AI agents are both effective and trustworthy for engineers Develop systems for automated evaluation of LLM outputs, including feedback loops and self-play Construct machine learning pipelines for data ingestion, feature generation, embedding storage, retrieval-augmented generation (RAG), vector search, and graph databases Prototype with open-source LLMs to understand their strengths and weaknesses Create scalable infrastructure for complex, multi-step agents, focusing on memory, state management, and asynchronous workflows Who We’re Looking For 5+ years of professional experience or significant open-source contributions Interest in LLMs, MCPs, cloud infrastructure, and observability tools Comfort working in changing, ambiguous situations Product-focused and customer-first mindset Experience learning from and collaborating with engineers from diverse backgrounds Bonus: Previous experience in a startup setting Work Location Hybrid schedule: three days per week in-person at the San Francisco HQ.

Apr 16, 2026

Apply

Technical Staff Member

Catalog

Full-time|On-site|San Francisco

At Catalog, we are pioneering the commerce infrastructure for AI—creating the essential framework that enables digital agents to not only explore the web but also comprehend, analyze, and engage with products. Our innovations drive the future of AI-driven shopping experiences, fundamentally transforming how consumers discover and purchase items online.Role OverviewAs a Technical Staff Member, you will be instrumental in developing core systems, shaping our engineering culture, and transitioning our vision from prototype to a robust platform. This role requires full-stack expertise and a commitment to owning and resolving challenges from start to finish.Who You AreYou have experience creating beloved and trusted products from the ground up.You combine technical proficiency with a keen product sense and data-driven intuition.You are well-versed in AI technologies.You prioritize speed, write clean code, and ensure thorough instrumentation.You seek a high level of ownership within a small, talent-rich team based in San Francisco.Challenges You Will TackleDevelop and deploy agentic-search APIs that deliver structured and real-time product data in milliseconds.Build checkout systems enabling agents to conduct transactions with any merchant.Create an embeddings and retrieval layer that optimizes recall, precision, and cost efficiency.Establish a product graph and ranking pipeline that adapts based on actual user outcomes.Preferred QualificationsProven experience shipping data-centric products in a live environment.Experience with recommendation systems or information retrieval methodologies.Familiarity with API development, search indexing, and data pipeline construction.Our Work CultureWe operate with a small, high-trust, and highly motivated team, fostering an environment of in-person collaboration in North Beach, San Francisco. Our process involves debate, decision-making, and execution.If your profile aligns with our needs, we will contact you to arrange 2-3 brief technical interviews, followed by an onsite meeting in our office where you will collaborate on a small project, exchange ideas, and meet the team.

Oct 15, 2025

Apply

Technical Staff Member

Chroma

Full-time|On-site|San Francisco, CA

At Chroma, we are at the forefront of AI data infrastructure, providing top-tier retrieval solutions that empower developers worldwide.Join us as we navigate the nascent stages of AI technology, and become part of a team that values curiosity and dedication to mastering your craft.There is significant work ahead, and we invite you to contribute to our mission.

Sep 9, 2024

Apply

Technical Staff Member

Adyen

Full-time|On-site|San Francisco

Join our dynamic team at Adyen as a Technical Staff Member in San Francisco! We are seeking innovative minds passionate about technology and problem-solving. In this role, you will collaborate with cross-functional teams to craft solutions that enhance our services and improve customer experiences.

Mar 6, 2026

Apply

Founding Member of Technical Staff at TierZero | San Francisco

tierzero

Full-time|Hybrid|SF HQ

TierZero seeks a Founding Member of Technical Staff to join the team in San Francisco. This in-person position requires working from the SF headquarters at least three days per week. Role overview This role centers on close collaboration with a group of engineers who have collectively delivered over $10 billion in value during their careers. Expect to work side by side with teammates, sharing ideas and building strong connections in the office. The environment often shifts, so adaptability and comfort with changing priorities are important. Key responsibilities Work directly with experienced engineers to design and build new products Prioritize customer needs and satisfaction in product decisions Develop solutions using large language models (LLMs), multi-cloud platforms (MCPs), cloud infrastructure, and observability tools Requirements Minimum 5 years of professional engineering experience or a strong record of open-source contributions Experience in startups and familiarity with their unique challenges is a plus Location This position is based in San Francisco. In-office presence is required three days each week for collaboration.

Apr 23, 2026

Apply

Founding Member of Technical Staff

tierzero

Full-time|Hybrid|SF HQ

About TierZero TierZero helps engineering teams use AI to build and ship code more efficiently. The platform targets the bottleneck of human speed in production, giving teams tools for faster incident response, better operational visibility, and shared knowledge. TierZero is backed by $7M in funding from investors including Accel and SV Angel. Companies like Discord, Drata, and Framer trust TierZero to strengthen their infrastructure for AI-driven engineering. Role Overview: Founding Member of Technical Staff This is an on-site role based at TierZero’s San Francisco headquarters, with three days a week in the office. As a founding member, direct collaboration with the CEO, CTO, and early customers shapes the direction of both product and systems. The work spans hands-on development and close engagement with users and leadership. What You Will Do Design and build intelligent AI systems to analyze large volumes of unstructured data. Deliver full-stack features based on real user feedback. Improve the product experience so AI agents are both reliable and easy for engineers to use. Develop systems that automatically evaluate LLM outputs and advance agentic reasoning using self-play and feedback loops. Create machine learning pipelines, including data ingestion, feature generation, embedding stores, retrieval-augmented generation (RAG), vector search, and graph databases. Prototype with open-source and new LLMs, comparing their strengths and weaknesses. Build scalable infrastructure for long-running, multi-step agents, with attention to memory, state, and asynchronous workflows. What We Look For Over five years of relevant professional or open-source experience. Comfort working in environments with uncertainty and evolving challenges. Strong product focus and a drive for customer satisfaction. Interest in large language models (LLMs), Model Control Planes (MCPs), cloud infrastructure, and observability tools. Previous startup experience is a plus. Location This position is based in San Francisco. Expect to work on-site three days per week at TierZero’s HQ.

Apr 15, 2026

Apply

Founding Member of Technical Staff

TierZero

Full-time|Hybrid|SF HQ

TierZero builds tools that help engineering teams deliver and manage code efficiently. The platform enables quicker incident response, clearer operational visibility, and shared knowledge among engineers. Backed by $7 million from investors like Accel and SV Angel, TierZero supports clients such as Discord, Drata, and Framer as they strengthen infrastructure for AI-driven work. This in-person role is based at TierZero's San Francisco headquarters, with a hybrid schedule requiring three days onsite each week. As a founding member of the technical staff, work directly with the CEO, CTO, and customers to influence the direction of TierZero’s core products and systems. The position calls for flexibility as priorities shift and close collaboration across the company. What you will do Design and develop AI systems that handle large volumes of unstructured data. Build full-stack product features, informed by direct feedback from users. Enhance the product so agents are intelligent, reliable, and easy for engineers to use. Create systems to automatically evaluate outputs from large language models and improve agentic reasoning through self-play and feedback. Construct machine learning pipelines, including data ingestion, feature creation, embedding stores, retrieval-augmented generation (RAG) pipelines, vector search, and graph databases. Experiment with open-source and emerging large language models to compare different approaches. Develop scalable infrastructure for long-running, multi-step agents, including memory, state management, and asynchronous workflows. Requirements Interest in working with large language models, managed cloud platforms, cloud infrastructure, and observability tools. At least 5 years of professional experience or significant open-source contributions. Comfort with shifting priorities and tackling new technical problems. Strong product focus and commitment to customer outcomes. Openness to learning from a team with a track record of delivering over $10 billion in value. Ability to work onsite in San Francisco three days per week. Bonus: Experience in a startup setting and familiarity with startup dynamics.

Apr 24, 2026

Apply

Founding Member of Technical Staff

tierzero

Full-time|On-site|SF HQ

tierzero is looking for a Founding Member of Technical Staff to help shape the direction of its technology from the ground up. This role is based at the company's San Francisco headquarters. Role overview As an early technical hire, you will work closely with engineers and product managers to build new products and features. The work centers on designing, coding, and delivering software solutions that address client needs and support tierzero's growth. Impact Contributions in this role will directly influence the company's future. The team values initiative and hands-on problem solving, giving each member a chance to make a visible difference in how the company evolves. Collaboration This position involves regular collaboration with a small, focused team. Input and ideas from every member help guide product direction and technical decisions.

Apr 29, 2026

Apply

Founding Member of Technical Staff

tierzero

Full-time|On-site|SF HQ

tierzero seeks a Founding Member of Technical Staff to play a key role in building the company’s technology from the earliest stages. This position is based at the San Francisco headquarters and offers the chance to collaborate directly with founders and engineers. Role overview As an early team member, you will help design and develop new products and systems. The work involves close collaboration with others in the office, shaping both the technical direction and the culture of the engineering team. What you will do Develop core technology in partnership with founders and engineers Contribute ideas and code that guide the evolution of tierzero’s products Help define engineering standards and establish best practices Location This position is based onsite at the San Francisco HQ.

Apr 27, 2026

Apply

Engineer, Supercomputing & Distributed Systems

Krea

Full-time|On-site|San Francisco

Join Krea's Innovative TeamAt Krea, we are at the forefront of developing next-generation AI creative tools. Our commitment lies in making AI an intuitive and controllable medium for creatives. We aspire to create tools that enhance human creativity rather than replace it.We view AI as a transformative medium that enables expressions across diverse formats—text, images, video, sound, and even 3D. Our focus is on creating smarter, more adaptable tools that leverage this medium effectively.The Role of Supercomputing and AI Infrastructure at KreaOur team is responsible for building and managing the foundational infrastructure that supports Krea's research and inference processes. This includes distributed training systems, over 1000 Kubernetes GPU clusters, and extensive petabyte-scale data pipelines. Much of our work involves creating bespoke solutions, such as custom distributed datastores, job orchestration systems, and advanced streaming pipelines, which are designed to handle modern AI workloads efficiently.Key Projects You Will Contribute To:Distributed Data Systems: Design and implement multi-stage pipelines to transform petabytes of raw data into clean, annotated datasets; run classification models across billions of images; deploy and integrate large language models to caption extensive multimedia data.GPU Infrastructure: Manage distributed training and inference across 1000+ GPU Kubernetes clusters; address orchestration and scaling challenges for large-scale GPU job processing; optimize research workflows across multiple datacenters.Distributed Training: Profile and enhance dataloaders streaming thousands of images per second; troubleshoot InfiniBand networking during extensive training runs; develop fault tolerance systems for large-scale pretraining; collaborate with researchers to refine reinforcement learning infrastructure.Applied ML Pipelines: Identify clean scenes in millions of videos utilizing distributed shot-boundary detection; tailor and train models to sift through billions of images for specific queries; construct systems that link raw cluster capacity with research outcomes.

Apr 3, 2026

Apply

Technical Staff Member - Design

Listen Labs

Full-time|On-site|San Francisco, CA

Overview: Due to the increasing market demand and a robust six-month product roadmap, Listen Labs is expanding its engineering team. We seek a technically adept individual (our team includes three IOI medalists) who is eager to contribute to a product that is revolutionizing corporate decision-making. If you are passionate about solving intricate problems from start to finish, we invite you to connect with us.About Listen LabsListen Labs is an innovative AI-driven research platform that empowers teams to swiftly extract insights from customer interviews in hours rather than months. Our technology enables clients to analyze conversations, identify recurring themes, and expedite informed product decisions.Company Highlights:Exceptional Team: Composed of seasoned entrepreneurs (with prior AI exits), co-founders, and experts from leading firms such as Jane Street, Twitter, Stripe, Affirm, Bain, Goldman Sachs, and more, our team is built on a foundation of excellence.Rapid Growth: We are a dynamic team of 40, supported by Sequoia, achieving a remarkable growth trajectory from $0 to $14 million run-rate in less than a year. We prioritize speed, craftsmanship, and collaboration with individuals who embrace ownership.Impressive Traction: We have seen rapid growth across various sectors, securing enterprise clients such as Google, Microsoft, Nestlé, and P&G.Outstanding Performance: Our industry-leading win rate is a direct result of our uniquely differentiated product.Market Validation: We consistently attract customers across every segment, often landing six-figure deals that lead to quick expansions.Viral Product: Our interviews are shared with tens of thousands of viewers, driving product-led growth, organic expansion, and daily inquiries from Fortune 500 companies.Technical Challenges:Research Agent Development: Unlike traditional software purchases, hiring McKinsey involves gaining insights and execution expertise. We are building Listen Labs with that mindset — an AI agent that understands our platform and best research practices, assisting users in project setup, interview execution, and response analysis.Human Database Creation: A core value proposition is our capability to connect users with specific demographics. We are developing a database of millions of individuals, continually enhancing our understanding of user needs as they engage with Listen Labs.

Feb 25, 2026

Apply

Lead Technical Staff Member - Inference Infrastructure

Cohere

Full-time|On-site|San Francisco

Cohere builds and deploys advanced AI models used by developers and enterprises. These models support applications like content generation, semantic search, retrieval-augmented generation (RAG), and intelligent agents. The team’s work aims to make AI more accessible and practical for real-world use. Each person at Cohere plays a direct role in strengthening the models and increasing their value for clients. The company values practical outcomes and continuous improvement, focusing on delivering reliable technology to users. The team includes researchers, engineers, designers, and professionals from a wide range of backgrounds. Cohere believes that diverse perspectives help create better products. The company welcomes those interested in shaping the future of AI to join its mission.

Apr 28, 2026

Create account — see all 2,443 results

1 - 20 of 2,443 Jobs

Select all on this page (20)

Apply

Technical Staff Member - Supercomputing Platform & Infrastructure

magic.dev

Full-time|On-site|San Francisco

Jan 25, 2024

Apply

Member of Technical Staff, AI Platform & Architecture (Infrastructure)

Postman, Inc.