Senior Site Reliability Engineer at Heidi Health | San Francisco

Heidi HealthSan Francisco

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

QualificationsWe are looking for candidates who have a strong background in site reliability engineering, particularly with experience in cloud technologies and Kubernetes management. Ideal candidates should also possess:Proficiency in incident management and response. Experience with automation tools and scripting languages. A solid understanding of monitoring and observability principles. Strong problem-solving skills and a proactive mindset. Excellent communication and collaboration abilities.

About the job

About Us

At Heidi, we believe healthcare should have a more harmonious flow—one that prioritizes continuous and compassionate care. Our mission is to develop an AI Care Partner that collaborates with healthcare professionals to achieve this vision.

We are a diverse team of medical practitioners, engineers, designers, researchers, and visionaries dedicated to creating tools that allow clinicians to concentrate on what truly matters: their patients.

In just 18 months, Heidi has enabled healthcare professionals to reclaim over 18 million hours, facilitating 73 million patient visits across 116 countries. We currently support more than two million patient visits globally each week.

With nearly $100 million in funding, we are expanding our reach across the US, UK, Canada, and Europe, collaborating with top-tier health systems such as the NHS, Beth Israel Lahey Health, and Monash Health.

Your Role

Incident Response and On-Call Duties:
Take part in incident management, addressing production issues, aiding in service restoration, and ensuring effective communication throughout. As you gain experience, you'll lead incidents from start to finish.
Enhancing Operational Reliability:
Identify and address recurring issues and reliability threats, implementing improvements through enhanced alerting, automation, system modifications, or process enhancements.
Ownership of Production Environment:
Manage and enhance Kubernetes clusters, cloud infrastructure, and core platform services, gradually increasing your ownership as you become more familiar with our systems.
Observability Improvement:
Refine dashboards, alerts, logs, and traces to enable quicker issue detection and resolution, focusing on actionable insights.
Minimizing Operational Toil:
Automate routine tasks, streamline runbooks, and enhance tools to simplify on-call responsibilities and daily operations.
Facilitating Safe Changes:
Enhance deployment methods, rollback strategies, and operational readiness to mitigate the risks of incidents due to changes.
Contribution to Operational Practices:
Document and maintain runbooks, engage in blameless post-mortems, and assist in refining incident response protocols over time.
Collaboration with Engineering Teams:
Work closely with product and feature teams to ensure seamless integration and functionality.

About Heidi Health

Heidi Health is at the forefront of transforming healthcare by developing innovative AI solutions that enhance patient care and clinician efficiency. Our diverse team is passionate about creating an impact in the healthcare sector, and we are proud to partner with some of the leading health systems worldwide.

Similar jobs

1 - 20 of 12,030 Jobs

Search for Senior Staff Site Reliability Engineer At Fluidstack San Francisco Ca

12,030 results

Select all on this page (20)

Apply

Senior / Staff Site Reliability Engineer at Fluidstack | San Francisco, CA

Fluidstack

Full-time|$175K/yr - $320K/yr|On-site|San Francisco, CA

About FluidstackAt Fluidstack, we are pioneering the infrastructure for advanced intelligence. Collaborating with leading AI laboratories, governmental bodies, and enterprises such as Mistral, Poolside, Black Forest Labs, and Meta, we aim to unlock computational power at unprecedented speeds.Our mission is urgent: to turn Artificial General Intelligence (AGI) into a tangible reality. Our team is driven, dedicated to delivering top-tier infrastructure, and we treat the outcomes of our customers as if they were our own, taking immense pride in the systems we develop and the trust we establish. If you are purpose-driven, passionate about excellence, and ready to work diligently to propel the future of intelligence, we invite you to join us in shaping what comes next.About the RoleAs a Senior / Staff Site Reliability Engineer (SRE) at Fluidstack, you will be central to our infrastructure, working across software, hardware, and operations to ensure the reliability and performance of our global GPU cloud.You will collaborate closely with teams in networking, platform engineering, and data center operations to construct systems that can scale to meet the increasing demands of AI workloads.SREs at Fluidstack are hands-on experts with profound systems knowledge and excellent communication skills. Your responsibilities will include addressing complex production challenges, deploying robust infrastructure, and continuously enhancing the stability and observability of our platform as we expand.A typical day might involve:Deploying clusters of over 1,000 GPUs using custom playbooks and adjusting these tools to deliver optimal solutions for our clients.Validating the correctness and performance of our compute, storage, and networking infrastructure, while collaborating with providers to enhance these subsystems.Migrating petabytes of data from public cloud platforms to local storage, efficiently and cost-effectively.Troubleshooting issues across the stack, ranging from hardware problems like obstructed server fans to optimizing S3 data loaders across different regions.Creating internal tools to reduce deployment times and enhance cluster reliability, including automation where customer benefits clearly surpass implementation costs.This role will require participation in an on-call rotation of up to one week per month.

Jan 14, 2026

Apply

Senior Site Reliability Engineer at Drata | San Francisco

Drata

Full-time|$166.9K/yr - $225.9K/yr|Hybrid|Hybrid - San Francisco

Drata helps organizations demonstrate their commitment to security and integrity. The platform supports companies as they build and maintain trust with users, customers, partners, and prospects. Values Built on Trust: Consistency shapes decisions and actions. Integrity: Choosing to do what is right, every time. Customer-Obsessed: Prioritizing customer needs above all else. Competitive Fire: Striving for higher standards and greater achievements. Diversity: Welcoming different perspectives to encourage creative solutions. Automation First: Pursuing efficiency by saving time and resources wherever possible. How the Team Works Drata blends high standards with a supportive environment focused on growth. Team members are encouraged to own their work, improve continuously, and deliver meaningful results. The company values quick, informed decisions that drive immediate impact, while always keeping the mission and customer needs at the center. The San Francisco-based team uses a hybrid work model. Colleagues collaborate in the office Tuesday through Thursday, focusing on alignment and innovation. Mondays and Fridays offer flexibility for deep work or personal needs. Growth and Culture Drata has expanded to over 600 professionals worldwide, recognized for a culture that values trust, speed, and continuous learning. The environment supports both personal and professional development. See the Speed: CEO Adam Markowitz discusses Drata’s rapid journey to $100M ARR in four years. Hear the Voice of the Team: Employee stories highlight collaboration and growth at Drata.

Apr 27, 2026

Apply

Senior Site Reliability Engineer at Carta | San Francisco, CA

Carta

Full-time|On-site|San Francisco, California; Santa Clara, California; Seattle, WA

Join Carta as a Senior Site Reliability Engineer, where you will play a pivotal role in enhancing our infrastructure and ensuring the reliability of our platforms. You will work collaboratively with cross-functional teams to implement innovative solutions that drive operational excellence and scalability.

Apr 3, 2026

Apply

Director of Infrastructure at Fluidstack | San Francisco

Fluidstack

Full-time|$250K/yr - $350K/yr|On-site|San Francisco, CA

About FluidstackAt Fluidstack, we are pioneering the infrastructure for advanced intelligence. Collaborating with leading AI laboratories, governmental bodies, and prominent enterprises such as Mistral, Poolside, Black Forest Labs, and Meta, we are committed to delivering computing capabilities at unparalleled speeds.We are driven by a sense of urgency to realize Artificial General Intelligence (AGI). Our team is highly dedicated to providing world-class infrastructure, prioritizing our customers' success as our own. We take immense pride in the systems we create and the trust we cultivate. If your motivation stems from purpose, a relentless pursuit of excellence, and a readiness to work diligently to accelerate the future of intelligence, we invite you to join us in shaping what lies ahead.About the RoleFluidstack is looking for a Director of Infrastructure who will be responsible for the hardware that supports some of the largest AI clusters globally. You will lead a multidisciplinary team of Networking Engineers, Compute Systems Engineers, Storage Engineers, and the ICT team, working closely with Procurement, Data Center Operations, Software Engineering, Site Reliability Engineering, Finance, Security, and Sales to ensure Fluidstack delivers and operates clusters more swiftly and reliably than any competitor.You have successfully deployed a GPU cluster with over 10,000 units using cutting-edge hardware. You possess the expertise to expedite deployment from months to weeks, having established the necessary tools, runbooks, and a culture that supports repeated success.

Mar 12, 2026

Apply

Senior Site Reliability Engineer at Hyperbolic | San Francisco

Hyperbolic Labs

Full-time|On-site|San Francisco, CA

Who We AreAt Hyperbolic Labs, we are committed to democratizing AI by removing barriers to computing power with our Open-Access AI Cloud. By aggregating global computing resources, we provide an innovative GPU marketplace and AI inference service that ensures both affordability and accessibility. As trailblazers at the convergence of AI and open-source technology, we envision a future where AI innovation is only limited by creativity, not by resource availability. We invite forward-thinking individuals who share our dedication to making AI universally accessible, secure, and affordable. Join us in crafting a platform that empowers innovators worldwide to realize their visionary AI projects.In anticipation of our growth following our Series A funding, our team — guided by co-founders with advanced degrees in AI, Mathematics, and Computer Science — is set to transform the computing landscape.About the RoleWe are in search of a skilled Site Reliability Engineer to guarantee that Hyperbolic's GPU marketplace and AI infrastructure function with outstanding reliability, performance, and security. As an aggregator of computational resources from numerous global providers, our service level objectives (SLOs), trust, and economic efficiency are critical to our product. Your key responsibilities will include defining and maintaining service level objectives, developing resilient incident response protocols, managing capacity across our extensive GPU network, and implementing secure rollout and rollback mechanisms to ensure uninterrupted platform operation around the clock.In this influential role, you'll set the reliability benchmarks that foster customer trust in our platform, design comprehensive monitoring and alerting systems for enhanced infrastructure visibility, automate capacity management and resource allocation processes, lead incident response and post-mortem evaluations, and collaborate closely with engineering teams to bolster system resilience. Security and infrastructure hardening will be paramount, necessitating strong isolation protocols between tenants and suppliers, the implementation of effective key management systems, and the establishment of compliance frameworks. This high-impact position will directly affect our ability to deliver on our commitment to providing affordable, accessible AI compute at scale.

Mar 26, 2026

Apply

Software Engineer - Inference Platform at Fluidstack | San Francisco

Fluidstack

Full-time|$165K/yr - $500K/yr|On-site|San Francisco, CA

Join the Fluidstack TeamAt Fluidstack, we’re pioneering the infrastructure for advanced intelligence. We collaborate with leading AI laboratories, governmental entities, and major corporations—including Mistral, Poolside, and Meta—to deliver computing solutions at unprecedented speeds.Our mission is to transform the vision of Artificial General Intelligence (AGI) into a reality. Driven by our purpose, our dedicated team is committed to building state-of-the-art infrastructure that prioritizes our customers' success. If you share our passion for excellence and are eager to contribute to the future of intelligence, we invite you to be part of our journey.Role OverviewThe Inference Platform team at Fluidstack is at the forefront of addressing the cost and latency challenges associated with frontier AI. You will play a crucial role in managing the serving layer that connects our global accelerator supply with the production workloads of our clients, which include LLM serving frameworks, KV cache infrastructure, and Kubernetes orchestration across multiple data centers.This hands-on individual contributor role combines elements of distributed systems, model optimization, and serving infrastructure. You will oversee the entire lifecycle of inference deployments for leading AI labs, striving for enhancements in throughput, cost-efficiency, and response times, while also influencing the architectural decisions that guide Fluidstack’s deployment strategies.

Mar 5, 2026

Apply

Office Manager at Fluidstack | San Francisco

Fluidstack

Full-time|$140K/yr - $205K/yr|On-site|San Francisco, CA

About FluidstackFluidstack is at the forefront of building the infrastructure for advanced intelligence. We collaborate with leading AI labs, governmental bodies, and major enterprises like Mistral, Poolside, Black Forest Labs, and Meta to deliver computational power at unprecedented speeds.Our mission is to accelerate the realization of Artificial General Intelligence (AGI). We are a dedicated team that prioritizes excellence and is passionate about creating world-class infrastructure. We view our customers' success as our own and take immense pride in the systems we develop and the trust we cultivate. If you are driven by a sense of purpose, have a relentless pursuit of excellence, and are eager to contribute to the future of intelligence, we invite you to join us in shaping what comes next.About the People TeamThe People team at Fluidstack is dedicated to fostering an environment where individuals can achieve their best work. We create and maintain the systems, environments, and partnerships that empower talented individuals to tackle significant challenges. Our work includes managing the infrastructure that supports the organization, enhancing employee experiences, and equipping managers and leaders to excel in their roles.Why This Role ExistsAs our office presence expands, it is crucial that daily operations are seamless and consistent. This role is essential to ensure that each Fluidstack office operates efficiently and meets high standards for employee satisfaction.About the RoleYour influence over the workplace experience at Fluidstack is significant. You will be responsible for managing the daily operations of your office, including the aesthetic and functional aspects of the space, as well as cultivating an inviting atmosphere for employees and visitors alike. You will report directly to the Head of Workplace and work closely with the People Ops team to address any issues that arise, whether it's fixing a broken monitor or enhancing lunch options, with equal diligence.This position demands a proactive approach and a strong sense of ownership. You will not wait for direction; instead, you will identify needs, take initiative, and implement solutions.What You Will DoFirst 30 DaysComplete onboarding to familiarize yourself with Fluidstack’s workplace standards, vendor landscape, and specific office dynamics.Shadow the current office setup or audit operations if opening a new location.

Mar 18, 2026

Apply

Lead Business Process Controls at Fluidstack | San Francisco

Fluidstack

Full-time|$180K/yr - $250K/yr|On-site|San Francisco, CA

About FluidstackFluidstack is revolutionizing the foundation of artificial intelligence by creating infrastructure that powers abundant intelligence. Collaborating with leading AI laboratories, government entities, and industry giants like Mistral, Poolside, and Meta, we are committed to facilitating compute capabilities at unprecedented speeds.Our mission is to accelerate the realization of Artificial General Intelligence (AGI). We are driven by a strong sense of urgency and a commitment to providing world-class infrastructure. Our focus is on ensuring our customers achieve their desired outcomes, and we take pride in the systems we create and the trust we cultivate. If you are passionate about purpose, dedicated to excellence, and eager to contribute to shaping the future of intelligence, we invite you to join our team.About the RoleAs Fluidstack continues to expand rapidly to cater to the needs of the foremost AI organizations globally, we are seeking a Senior Manager for Business Process Controls to ensure operational excellence aligns with our ambitious technical goals. You will be responsible for designing, implementing, and continuously enhancing the vital processes that drive our business, including customer onboarding, vendor operations, internal workflows, and cross-functional collaboration. Collaborating closely with Engineering, Finance, Sales, and Operations teams, you will work to eliminate inefficiencies, enhance operational efficiency, and establish scalable systems that facilitate Fluidstack's growth without disruptions. This role is highly impactful and visible, ideal for an individual who excels at the intersection of strategic planning and execution.

Mar 9, 2026

Apply

Senior Site Reliability Engineer at prosper | San Francisco

prosper

Full-time|On-site|San Francisco, CA

Role overview The Senior Site Reliability Engineer at prosper plays a key role in maintaining and improving the reliability and performance of the company’s core systems. Collaboration with teams across the organization is essential to ensure services remain stable and efficient. What you will do Design and set up monitoring tools to track the health and performance of systems Automate routine operational tasks to minimize manual intervention and boost efficiency Diagnose and resolve complex technical problems that impact infrastructure or services Support projects aimed at strengthening infrastructure stability and preparing for future growth Location This role is located in San Francisco, CA.

Apr 27, 2026

Apply

Senior Site Reliability Engineer at Plaud | San Francisco

Plaud Inc.

Full-time|On-site|San Francisco, CA

About Plaud Inc.Plaud is revolutionizing the way professionals enhance productivity and performance with our trusted AI work companion. Our innovative note-taking solutions have gained the admiration of over 1,500,000 users globally since our inception in 2023. We are on a mission to amplify human intelligence by developing next-generation intelligence infrastructure and interfaces that seamlessly capture, extract, and leverage what you say, hear, see, and think.Based in San Francisco, Plaud Inc. is a Delaware-incorporated company that is redefining the boundaries of human-AI collaboration through a unique combination of hardware and software solutions. We adhere to the highest standards of data security and privacy protection, with certifications including ISO 27001, ISO 27701, GDPR, SOC 2, HIPAA, and EN 18031 compliance.Discover more about our innovative solutions by visiting https://www.plaud.ai and follow us on Instagram, X, Facebook, LinkedIn, and YouTube.Why You Should Join UsAt Plaud, you will play a pivotal role in shaping the future of human-AI interaction. Here’s what we offer:A thriving, bootstrapped company with a remarkable $250M revenue run rate achieved in just three years.An opportunity to define the next-generation paradigm for human-AI interaction.Direct exposure to cutting-edge AI tools for professionals and a chance to contribute to our global expansion.Collaborate with a passionate team that values innovation, teamwork, and customer success.Advance your career in a culture that promotes continuous learning and rapid career growth.

Feb 24, 2026

Apply

Site Reliability Engineer at Mercor | San Francisco

Mercor

Full-time|On-site|San Francisco

Join the Mercor TeamAt Mercor, we stand at the dynamic intersection of labor markets and AI research. Collaborating with premier AI labs and enterprises, we empower the human intelligence that is crucial for AI's evolution.Our expansive talent network plays a vital role in training cutting-edge AI models, akin to the way educators impart knowledge to their students—by sharing insights, experiences, and contextual understanding that code alone cannot convey. Currently, our network of over 30,000 experts generates more than $2 million daily.We are pioneering a novel category of work where expertise fuels AI progress. Achieving this vision necessitates an ambitious, fast-paced, and deeply dedicated team. You will collaborate with researchers, operators, and AI firms that are at the forefront of transforming societal structures.Mercor is a thriving Series C company with a valuation of $10 billion. We operate five days a week in-person at our new headquarters in San Francisco.About the RoleAs a Site Reliability Engineer (SRE) at Mercor, you will take ownership of production reliability for our critical systems, working closely with our infrastructure leadership. You will play a pivotal role in establishing our SRE function and defining how Mercor manages large-scale, high-availability systems.Your ResponsibilitiesEnsure the reliability and safety of production for key shared services and customer-facing systems.Collaborate directly with infrastructure leadership to outline SRE priorities, reliability benchmarks, and the production safety roadmap.Enhance the structure of our production systems to ensure stability, resource efficiency, isolation, and observability.Advocate for and implement modern SRE methodologies (e.g., incident management, postmortems, SLIs/SLOs) across engineering teams.Work alongside engineering and applied AI teams to facilitate sustainable growth.Promote SRE best practices internally, supporting teams in a safe, scalable, and consistent production onboarding process.Who We SeekThe ideal candidate will have:Extensive experience in genuine SRE roles (not merely operations) across various positions or organizations.A deep understanding of SRE methodologies popularized by Google (e.g., error budgets, reliability vs. risk trade-offs, large-scale distributed systems).5+ years of SRE experience; ideally, 15+ years in total experience for this inaugural SRE position.A proven track record of managing systems at scale, with a strong grasp of the complexities involved.

Dec 27, 2025

Apply

Site Reliability Engineer at Superhuman | San Francisco

Superhuman, Inc.

Full-time|$214K/yr - $260K/yr|Hybrid|Hub - San Francisco

At Superhuman, we embrace a vibrant hybrid work model that offers our team members the ideal blend of focused individual work and collaborative in-person interactions, fostering trust, innovation, and a robust team culture.About SuperhumanSuperhuman, the AI productivity platform, is on a transformative mission to unlock the superhuman potential within everyone. With the integration of Grammarly's writing assistance and innovative tools like Coda’s collaborative workspaces and Go, our proactive AI assistant, we empower over 40 million individuals and 50,000 organizations globally. Founded in 2009, we strive to eliminate busywork and enhance productivity. Discover more at superhuman.com and explore our values here.The OpportunityTo meet our ambitious goals, we are seeking a Site Reliability Engineer (SRE) to join our infrastructure team. This pivotal role focuses on developing software solutions to maintain the reliability of our back-end systems while collaborating with engineering teams to strategize our future growth. You will also engage with our production engineering teams in Europe as we transition from a “you build it, you own it” approach.At Superhuman, our engineers and researchers enjoy the autonomy to innovate and drive breakthroughs, directly impacting our product roadmap. As we rapidly scale our interfaces, algorithms, and infrastructure, the complexity of our technical challenges is growing. Learn more about our technical endeavors on our technical blog.As an SRE, your responsibilities will include:Scaling our Kubernetes-based control plane that processes billions of events each day.Enhancing our automation mechanisms to efficiently respond to workload demands.Deploying machine learning systems across various departments.

Jun 18, 2025

Apply

Systems Controls Lead at Fluidstack | San Francisco

Fluidstack

Full-time|$180K/yr - $250K/yr|On-site|San Francisco, CA

About FluidstackAt Fluidstack, we are pioneering the infrastructure that fuels abundant intelligence. We collaborate with leading AI laboratories, government entities, and major corporations—including Mistral, Poolside, Black Forest Labs, and Meta—to facilitate computing capabilities at unparalleled speeds.Our mission is to urgently transform the concept of Artificial General Intelligence (AGI) into reality. Our team is driven, passionate, and dedicated to constructing world-class infrastructure. We view our clients' success as our own, taking pride in the systems we create and the trust we establish. If you are inspired by purpose, dedicated to excellence, and eager to work diligently to propel the future of intelligence forward, we invite you to join us in shaping what lies ahead.About the RoleAs the Systems Controls Lead, you will take charge of designing, implementing, and continuously refining Fluidstack's General IT Controls (GITC) framework. You will work at the confluence of infrastructure, compliance, and security—ensuring that the systems that drive the future of AI are supported by a robust, auditable control environment. This role is critical and high-impact within a streamlined team, collaborating closely with Engineering, Security, Legal, and Finance to expand our controls program in line with business growth.

Mar 3, 2026

Apply

Site Reliability Engineer at Superhuman | San Francisco

Superhuman

Full-time|$214K/yr - $260K/yr|Hybrid|San Francisco, CA

At Superhuman, we embrace a flexible hybrid working model that combines focused work time with in-person collaboration, fostering trust, innovation, and a vibrant team culture.About SuperhumanSuperhuman, now part of Grammarly, is an AI productivity platform dedicated to unlocking the superhuman potential in everyone. Our suite of applications integrates AI with over 1 million tools and websites, offering innovative solutions such as Grammarly's writing assistance, Coda's collaborative workspaces, Mail's inbox management, and Go, our proactive AI assistant. Since our inception in 2009, we have empowered over 40 million individuals and 50,000 organizations worldwide, enabling them to eliminate busywork and focus on what truly matters. Discover more at superhuman.com and explore our values here.The OpportunityIn pursuit of our ambitious goals, we are seeking a Site Reliability Engineer to enhance our infrastructure team. This pivotal role involves building software that ensures the reliability of our back-end systems while collaborating closely with our engineering teams. You will also help plan for our future growth as we shift from a “you build it, you own it” model.Our engineers and researchers enjoy the freedom to innovate and influence our product roadmap, tackling increasingly complex technical challenges as we scale our systems. Learn more about our technical endeavors on our technical blog.As a Site Reliability Engineer, your responsibilities will include:Scaling our Kubernetes-based control plane, processing billions of events daily.Enhancing our automation mechanisms in response to workload demands.Deploying machine learning systems across the organization.

Mar 18, 2026

Apply

Senior Accountant at Fluidstack | San Francisco

Fluidstack

Full-time|$120K/yr - $140K/yr|On-site|San Francisco, CA

About FluidstackAt Fluidstack, we are pioneering the infrastructure for advanced intelligence. Collaborating with leading AI labs, government entities, and major corporations such as Mistral, Poolside, Black Forest Labs, and Meta, we aim to revolutionize computing capabilities at unprecedented speeds.Our mission is driven by a sense of urgency to transform AGI from a concept into reality. Our team is passionate, dedicated, and committed to developing world-class infrastructure. We prioritize our customers' success as if it were our own, taking pride in the systems we create and the trust we build. If you are passionate about meaningful work, strive for excellence, and are eager to contribute to the future of intelligence, we invite you to join us in creating what comes next.About the RoleAs a Senior General Ledger Accountant at Fluidstack, you will play a crucial role in our expanding Finance team, ensuring the accuracy and integrity of our financial records as we experience significant growth. You will oversee the complete general ledger process, assist with month-end and year-end closing activities, and collaborate across various teams to guarantee precise and timely financial reporting. This high-impact position is ideal for individuals who thrive in a dynamic environment and are eager to establish top-tier accounting systems at one of the most innovative companies in the AI sector.

Mar 9, 2026

Apply

Site Reliability Engineer at Blaxel | San Francisco

Blaxel

Full-time|On-site|San Francisco

Join Our Team as a Site Reliability EngineerBlaxel is seeking a highly skilled Site Reliability Engineer to enhance the reliability, performance, and scalability of our cutting-edge AI infrastructure platform.In this role, you will develop and manage the essential systems that support scalable agentic AI. Your primary goal: maintain our ultra-low-latency, stateful, serverless compute engine, ensuring it remains robust as we handle billions of agent requests from the world's most advanced AI teams.This position is deeply technical and execution-oriented. You will take charge of our reliability framework, encompassing observability, performance optimization, incident management, infrastructure health, and the automation processes that ensure seamless operations. We are looking for innovators who can design new reliability systems, advance automation capabilities, and continuously adapt the platform to accommodate next-generation AI workloads. If you are a builder who excels in managing critical infrastructure at scale, we want to hear from you.Your ResponsibilitiesWorking closely with our founders, infrastructure team, and development team—leveraging AI for maximum efficiency—you will architect and manage the systems that keep Blaxel fast, resilient, and secure.Design, operate, and iteratively enhance the core infrastructure that drives our 25ms cold-start compute engine.Develop and refine our observability stack (metrics, traces, logs), ensuring proactive issue detection.Establish, monitor, and drive SLOs/SLIs across vital system components to ensure world-class reliability.Lead incident response with precision: conduct root cause analyses, post-mortems, and implement systemic solutions.Design and deploy self-healing, automated operational systems to minimize manual work and scale operations.Collaborate across compute, networking, storage, and sandboxed execution layers to optimize performance under intense workloads.Create automation tools—often utilizing AI agents—to enhance operations, debugging, capacity planning, and failure predictions.Test and stress our systems to their limits: engage in load testing, chaos engineering, and performance benchmarking.Champion security best practices at the infrastructure level, from sandboxed compute to network isolation.Collaborate with platform engineers to ensure reliability is an integral part of new features from inception.Who You AreExtensive technical expertise in site reliability engineering, with a passion for building scalable systems.

Mar 3, 2026

Apply

Product Manager for AI Infrastructure at Fluidstack | San Francisco

Fluidstack

Full-time|$180K/yr - $250K/yr|On-site|San Francisco, CA

About FluidstackAt Fluidstack, we are revolutionizing the infrastructure for advanced intelligence. Collaborating with leading AI laboratories, government entities, and major corporations such as Mistral, Poolside, Black Forest Labs, and Meta, we aim to deliver computational power at unprecedented speeds.We are diligently striving to realize Artificial General Intelligence (AGI). Our team is driven by a shared mission, emphasizing high-quality infrastructure that enhances our clients' success. We pride ourselves on our commitment to excellence and the trust we build with our customers. If you are passionate about making a meaningful impact and are dedicated to advancing the frontier of intelligence, we invite you to join us in shaping the future.Role OverviewWe are seeking a Product Manager to lead our AI platform roadmap, encompassing managed inference and agent platforms. You will be responsible for defining how Fluidstack empowers customers to deploy, scale, and optimize large language model (LLM) inference workloads—covering aspects from model serving and routing to agent orchestration and complex AI systems. This role involves balancing customer demands for low latency and high throughput with the practical considerations of GPU utilization, cost-effectiveness, and platform reliability. You will collaborate with engineering, machine learning research, and go-to-market teams to strategically position Fluidstack against inference-driven competitors such as Together AI, Fireworks, Baseten, Modal, and Replicate.Key ResponsibilitiesLead the product strategy and roadmap for managed inference services, focusing on model deployment, autoscaling, multi-LoRA serving, and inference optimization.Define requirements for agent platform functionalities, including structured outputs, function calling, memory primitives, tool integration, and multi-step reasoning workflows.Make informed decisions regarding prioritization of inference optimizations such as speculative decoding, continuous batching, KV cache management, quantization support, and custom kernel integration.Collaborate with ML infrastructure engineers to create APIs, SDKs, and deployment workflows that facilitate model fine-tuning, version management, and A/B testing.Partner with datacenter teams to enhance GPU allocation strategies—balancing dedicated versus serverless deployments, cold start latency, and cost-per-token economics.Conduct competitive analysis of offerings from Together AI (inference optimization stack), Fireworks (custom inference engine), Baseten (training-to-inference integration), and Modal (serverless architecture).Establish pricing models that reflect customer usage patterns (tokens, requests, GPU-hours) while ensuring platform sustainability.

Mar 3, 2026

Apply

Senior Site Reliability Engineer at Unify | San Francisco

Unify

Full-time|On-site|San Francisco Office

About UnifyAt Unify, we're pioneering the first AI-driven system of action for revenue teams. Our innovative approach empowers companies to transform their outbound strategies into a leading growth engine, ensuring that go-to-market execution is observable, repeatable, and scalable. Established in 2023 by visionaries from Ramp and Scale AI, our diverse team boasts experience from industry giants such as Airbnb, Meta, Waymo, and Perplexity.Having achieved an impressive 8x revenue growth in 2024, we proudly serve esteemed clients including Perplexity, Cursor, SoFi, and Justworks. With a dynamic team that has successfully raised $58M from prominent investors like Thrive, Emergence, and OpenAI, we are at the forefront of revolutionizing the future of GTM. Come and be a part of this exciting journey!About the RoleAs a Senior Site Reliability Engineer (SRE) at Unify, you will play a pivotal role in addressing the challenges of scaling and maintaining reliability as we handle immense data volumes and support enterprise clients with stringent uptime standards. Your expertise will span the entire tech stack—optimizing databases, fortifying services, and crafting automation and observability tools to ensure Unify remains fast and dependable at scale.

Jan 5, 2026

Apply

Senior Site Reliability Engineer at Heidi Health | San Francisco

Heidi Health

Full-time|On-site|San Francisco

About UsAt Heidi, we believe healthcare should have a more harmonious flow—one that prioritizes continuous and compassionate care. Our mission is to develop an AI Care Partner that collaborates with healthcare professionals to achieve this vision.We are a diverse team of medical practitioners, engineers, designers, researchers, and visionaries dedicated to creating tools that allow clinicians to concentrate on what truly matters: their patients.In just 18 months, Heidi has enabled healthcare professionals to reclaim over 18 million hours, facilitating 73 million patient visits across 116 countries. We currently support more than two million patient visits globally each week.With nearly $100 million in funding, we are expanding our reach across the US, UK, Canada, and Europe, collaborating with top-tier health systems such as the NHS, Beth Israel Lahey Health, and Monash Health.Your RoleIncident Response and On-Call Duties:Take part in incident management, addressing production issues, aiding in service restoration, and ensuring effective communication throughout. As you gain experience, you'll lead incidents from start to finish.Enhancing Operational Reliability:Identify and address recurring issues and reliability threats, implementing improvements through enhanced alerting, automation, system modifications, or process enhancements.Ownership of Production Environment:Manage and enhance Kubernetes clusters, cloud infrastructure, and core platform services, gradually increasing your ownership as you become more familiar with our systems.Observability Improvement:Refine dashboards, alerts, logs, and traces to enable quicker issue detection and resolution, focusing on actionable insights.Minimizing Operational Toil:Automate routine tasks, streamline runbooks, and enhance tools to simplify on-call responsibilities and daily operations.Facilitating Safe Changes:Enhance deployment methods, rollback strategies, and operational readiness to mitigate the risks of incidents due to changes.Contribution to Operational Practices:Document and maintain runbooks, engage in blameless post-mortems, and assist in refining incident response protocols over time.Collaboration with Engineering Teams:Work closely with product and feature teams to ensure seamless integration and functionality.

Feb 26, 2026

Apply

Site Reliability Engineer at EngFlow | San Francisco

EngFlow

Full-time|On-site|San Francisco

Join Our Team at EngFlowEngFlow is revolutionizing the software development process by enabling developers to save valuable time in their build and test cycles. Our innovative cloud-based distributed service optimizes workflows through advanced remote execution and caching, significantly enhancing efficiency, productivity, and product quality.Supported by esteemed investors, EngFlow is at the forefront of transforming how organizations develop software and deliver thoroughly tested products. Our solutions can accelerate builds by tenfold or more, and our observability platform provides crucial insights for ongoing optimization. Founded by leading contributors to Bazel, we create tools that empower engineering teams, from startups to Fortune 500 companies, to boost developer velocity and build performance.Discover more about our mission, culture, and team: EngFlow | Watch Our VideoWe are seeking a talented and experienced Site Reliability Engineer to join our dynamic engineering team. In this pivotal role, you will bridge the gap between software engineering and systems operations, ensuring our distributed infrastructure is highly available, performant, and scalable, thereby allowing our engineers to work swiftly and with confidence.

Jan 27, 2026

Create account — see all 12,030 results