Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Senior
Qualifications
Key Responsibilities:Investigate and resolve infrastructure issues reported by internal teams. Provide technical guidance and support across multiple technical domains. Contribute to runbooks, documentation, and knowledge sharing. Mentor junior team members on SRE best practices and troubleshooting methodologies. Identify and implement improvements to monitoring, alerting, and incident response processes. Required Qualifications:7+ years of experience in Site Reliability Engineering or equivalent systems administration. Strong proficiency with Kubernetes and container orchestration. Solid background in Linux/Unix systems administration. Understanding of CI/CD processes and deployment strategies. Familiarity with networking concepts. Experience with infrastructure as code, troubleshooting, and general architecture. Exceptional communication and documentation skills. Preferred Technologies/Languages:Kubernetes, Terraform, Golang, Python. Experience working collaboratively in a cross-functional capacity. Knowledge of compliance and change management processes.
About the job
Okta is seeking a Staff Site Reliability Engineer to join the Infrastructure Platform AGILE SRE team in San Francisco. This position centers on supporting and improving the systems that underpin Okta’s identity infrastructure.
Role overview
The Staff SRE will work closely with multiple teams to develop and maintain critical infrastructure. A core part of this role involves enhancing internal tools and operational processes, ensuring that Okta’s systems remain secure and reliable as the company grows.
What you will do
Provide cross-functional support to teams building and maintaining key infrastructure components.
Collaborate with Infrastructure Operations groups to address complex technical challenges.
Diagnose, troubleshoot, and resolve sophisticated infrastructure issues by developing new tools and strategic solutions.
Who we’re looking for
Experienced SREs who are comfortable working on large-scale, impactful projects.
Engineers who enjoy collaborating across teams and disciplines.
Problem-solvers who can tackle intricate technical challenges and deliver reliable solutions.
This role offers the chance to contribute directly to Okta’s mission of building secure, trusted infrastructure for organizations navigating the evolving landscape of AI and identity.
About Okta, Inc.
At Okta, we are committed to securing digital identities for individuals and organizations alike. Our innovative solutions provide the foundation for safe and seamless access to applications, ensuring that users can engage with technology confidently in an increasingly digital world.
Full-time|$162K/yr - $249K/yr|On-site|San Francisco, California
Okta is seeking a Staff Site Reliability Engineer to join the Infrastructure Platform AGILE SRE team in San Francisco. This position centers on supporting and improving the systems that underpin Okta’s identity infrastructure. Role overview The Staff SRE will work closely with multiple teams to develop and maintain critical infrastructure. A core part of this role involves enhancing internal tools and operational processes, ensuring that Okta’s systems remain secure and reliable as the company grows. What you will do Provide cross-functional support to teams building and maintaining key infrastructure components. Collaborate with Infrastructure Operations groups to address complex technical challenges. Diagnose, troubleshoot, and resolve sophisticated infrastructure issues by developing new tools and strategic solutions. Who we’re looking for Experienced SREs who are comfortable working on large-scale, impactful projects. Engineers who enjoy collaborating across teams and disciplines. Problem-solvers who can tackle intricate technical challenges and deliver reliable solutions. This role offers the chance to contribute directly to Okta’s mission of building secure, trusted infrastructure for organizations navigating the evolving landscape of AI and identity.
Full-time|$200.8K/yr - $250.9K/yr|On-site|San Francisco, California
About HeartFlow HeartFlow, Inc. is a medical technology company focused on improving the diagnosis and management of coronary artery disease. Our flagship product, the AI-powered HeartFlow FFRCT Analysis, provides a non-invasive, color-coded 3D view of a patient’s coronary arteries. Clinicians use our platform to identify blockages, assess blood flow, and analyze atherosclerosis, all in alignment with ACC/AHA Chest Pain Guidelines. HeartFlow’s technology supports care teams in the US, UK, Europe, Japan, and Canada, and has already impacted over 500,000 patients worldwide. As a publicly traded company (NASDAQ: HTFL), HeartFlow continues to expand its product line and modernize its platform to support the next generation of life-saving medical technologies. Role Overview: Staff/Lead Site Reliability Engineer (SRE) HeartFlow is searching for an experienced Site Reliability Engineer to join the cloud-native infrastructure team in San Francisco, California. This role works closely with Platform engineers and development teams to maintain and improve the reliability, scalability, observability, and performance of critical systems. What You Will Do Collaborate with Platform and development teams to ensure system reliability and performance Automate complex operational processes and reduce manual work Establish and promote standards for production excellence Support ongoing Platform Modernization initiatives Who We’re Looking For Extensive experience as a Site Reliability Engineer or in a similar role Strong background in cloud-native infrastructure Interest in automation, reliability, and scalable systems Comfort working with cross-functional engineering teams Location This position is based in San Francisco, California.
ABOUT BASETENBaseten is at the forefront of powering mission-critical AI inference for some of the most innovative companies globally, including Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer. We integrate cutting-edge applied AI research with a flexible infrastructure and intuitive developer tools to empower companies at the leading edge of AI to deploy sophisticated models effectively. With our recent $300M Series E funding round—supported by prominent investors such as BOND, IVP, Spark Capital, Greylock, and Conviction—we are rapidly expanding. Join our dynamic team and contribute to creating an essential platform for engineers to launch AI products with ease.THE ROLEAs a Site Reliability Engineer, you will design and implement resilient systems and processes that ensure our infrastructure is scalable, reliable, and efficient. Your responsibilities will encompass everything from automating deployments and monitoring systems to enhancing performance and managing incidents effectively.Collaboration is key; you will work closely with our users to understand their challenges in operationalizing machine learning, facilitating their onboarding onto our platform, and leveraging these insights to inform improvements to Baseten.EXAMPLE INITIATIVESAs part of our Infrastructure team, you will engage in exciting projects such as:Innovative multi-cloud capacity managementOptimizing inference on B200 GPUsImplementing multi-node inferenceUtilizing fractional H100 GPUs for efficient model servingRESPONSIBILITIESDesign and maintain scalable infrastructures to support the deployment and operational needs of machine learning models.Establish standards and best practices to enhance reliability and performance across the infrastructure.Proactively identify and resolve reliability issues using monitoring and alerting systems.Collaborate with cross-functional teams to apply best practices in infrastructure management and incident response.Create automation scripts to streamline processes and reduce manual intervention.
About UsAt Sierra, we are pioneering a transformative platform that empowers businesses to forge authentic customer experiences through AI technology. Headquartered in the vibrant city of San Francisco, we also boast a dynamic presence in Atlanta, New York, London, France, Singapore, and Japan.Our operations are anchored in core values that shape our culture: Trust, Customer Obsession, Craftsmanship, Intensity, and Family. These principles guide our actions and are integral to our mission.Our visionary founders, Bret Taylor and Clay Bavor, bring unparalleled expertise. Bret, currently the Board Chair of OpenAI, previously co-led Salesforce and served as CTO at Facebook, while Clay led numerous initiatives at Google, including AR/VR projects and Google Workspace.Your RoleIn your capacity as a Software Engineer on the Site Reliability team, you will play a crucial role in establishing and enhancing the reliability, observability, and scalability of Sierra’s AI-centric infrastructure. Collaborating closely with our engineering and product teams, your goal is to ensure our systems remain highly available, efficient, and primed for growth.Lead the development of Sierra’s observability stack—including monitoring, alerting, logging, and tracing—to provide engineers with critical insights into system health and performance.Collaborate with product and platform engineers to architect systems that prioritize reliability and scalability from the outset, not as an afterthought.Design and implement robust, scalable, and secure cloud infrastructure on AWS, employing Terraform and cutting-edge DevOps tools.Enhance the reliability and scalability of our LLM deployments, ensuring they operate efficiently and cost-effectively.Drive improvements in deployment pipelines, CI/CD tooling, and incident management processes to minimize downtime and accelerate response times.Define and cultivate SRE practices within Sierra, shaping culture, tooling, and best practices across the engineering organization.QualificationsBachelor's degree in Computer Science or a related field, or equivalent experience.Proven experience in Site Reliability Engineering or a similar role, with a strong understanding of cloud infrastructure (AWS).Proficiency in Terraform and modern DevOps practices.Experience with observability tools and techniques—monitoring, alerting, logging, and tracing.Strong problem-solving skills with a focus on scalability and performance optimization.Excellent collaboration and communication skills, with the ability to work effectively in a team environment.
Join our dynamic team at fal as a Senior/Staff Site Reliability Engineer. In this key role, you will leverage your expertise to enhance our systems' reliability and performance. If you are passionate about building scalable systems and enjoy working in a collaborative environment, we want to hear from you!
Full-time|On-site|North / Central / South America (in-person San Francisco Preference)
About ABC Labs:Reserve is an innovative cryptocurrency project pioneering the asset-backed currency revolution. ABC Labs developed Reserve to empower individuals to launch, mint, and redeem on-chain crypto indexes known as Decentralized Token Folios (DTFs) using robust, safety-first smart contracts. Experience expansive crypto exposure, earn effortless DeFi yield, or help create the next world reserve currency. Currently, only 0.03% of crypto is in indexes, a number we anticipate will grow rapidly as the DTF space expands, and Reserve is at the forefront of this movement.As we continue to tokenize real-world assets, we envision a protocol that facilitates new asset-backed currencies that are largely or fully independent of fiat currencies. Learn more about our vision and the Reserve team here.ABC Labs plays a vital role in the development of the Reserve protocol, contributing to the growth and sustainability of the Reserve ecosystem.Role Summary:We are seeking a highly skilled engineer to join our expanding protocol engineering team. You will work with the Ethereum mainnet and its Layer 2 solutions, custom APIs, data pipelines, Docker, Cloudflare, metrics software, and other tools to build and maintain a scalable backend infrastructure. Your focus will be on ensuring our frontend UI, backend APIs, and developer operations maintain exceptional reliability and scalability. Users deserve an intuitive, seamless DeFi experience without compromising security or decentralization. The ideal candidate possesses full-stack experience, specializes as an SRE, and has a passion for scaling operations. Leadership capabilities to guide a small team toward achieving our goals will be essential. As a startup, team members often wear multiple hats, but being an outstanding SRE is your primary responsibility.Our Tech Stack:Bare metal serversLinux (Ubuntu)DockerRedisPostgreSQLCloudflareTypeScriptRustResponsibilities:Provision, configure, and secure Linux servers, preferably through automationManage blockchain nodes to ensure maximum uptimeDeploy and configure monitoring tools such as Prometheus & GrafanaConduct load testing to identify and resolve bottlenecks in our APIOversee fleets of Docker containers using Dokku, Swarm, or Kubernetes
Full-time|$170K/yr - $230K/yr|On-site|Palo Alto / San Francisco Bay Area
Mithril develops AI infrastructure aimed at making GPU computing more accessible and affordable for enterprises, AI startups, and researchers. Clients include LG AI Research, Saronic, and the Broad Institute. The company was founded by a former Google DeepMind research scientist and a Stanford CS PhD. Mithril has secured $80M in seed and Series A funding from Sequoia Capital and Lightspeed Venture Partners. Over the past year, platform revenue has grown more than sixfold. Fast Company recognized Mithril as the 8th Most Innovative Company in Artificial Intelligence for 2026. The engineering team at Mithril is small, with each member making a significant impact. This Site Reliability Engineer (SRE) position is a foundational role focused on shaping how the platform scales across a multi-cloud environment. Role overview This SRE will play a central role in keeping Mithril's global GPU orchestration platform stable and high-performing. The responsibilities extend beyond day-to-day maintenance. The primary focus is on designing and building automation, observability, and tooling to help manage advanced compute resources across multiple cloud providers. The goal is to ensure customers have fast and dependable access to infrastructure. Collaboration with Mithril's founding team is central to this job. The SRE will help set service level objectives (SLOs), orchestrate capacity, and make influential infrastructure decisions, gaining visibility into both technical and commercial aspects of the business. What makes this SRE role unique This position differs from many early-stage SRE roles that focus mainly on on-call rotations and incident response. Here, the emphasis is on building infrastructure that actively shapes Mithril's marketplace. The systems developed will determine how supply is sourced, allocated, and monitored across providers, directly affecting customer experience and company revenue. The role offers genuine ownership, a fast feedback loop with leadership, and the opportunity to define how infrastructure engineering evolves as Mithril grows. Core responsibilities About 70–75% of the work centers on platform reliability and infrastructure automation. Reliability & SLOs Implement and manage service level indicators (SLIs) and service level objectives (SLOs) for Mithril's API layer and internal orchestration services to maintain high reliability and performance.
Full-time|$194K/yr - $267K/yr|On-site|San Francisco, California
Discover OktaOkta is recognized as The World’s Identity Company, empowering individuals to securely leverage any technology across various devices and applications. Our versatile Okta Platform and Auth0 Platform provide reliable access, authentication, and automation, placing identity at the forefront of business security and expansion.At Okta, we value diverse perspectives and experiences. We seek continuous learners and individuals who can enhance our team with their distinct backgrounds.Join us as we create a world where identity is truly yours.We are in search of a highly skilled Observability Site Reliability Engineer specializing in Google Cloud, to take charge of and elevate our Observability ecosystem within GCP. In this position, you will progress beyond basic monitoring to develop a world-class, comprehensive, and scalable Observability Platform that supports our SRE teams and business collaborators. You will implement infrastructure as code by employing Terraform and demonstrating strong coding skills in Go, Python, or Ruby to automate the deployment of agents and collectors across intricate distributed systems.Key ResponsibilitiesAutomated Infrastructure: Design, build, and maintain scalable observability infrastructure utilizing tools such as Terraform.GCP Observability Engineering: Enhance the collection, processing, and storage of Observability data to guarantee high reliability and low latency for our Splunk and Grafana services.Incident Response: Engage in on-call rotations and conduct post-incident reviews to foster systemic improvements and promote 'observability-driven development.'Automation: Minimize 'toil' by automating the deployment and scaling of observability agents and collectors.
Why Join Harvey?At Harvey, we're not just changing the landscape of legal and professional services; we're revolutionizing it from the ground up. By integrating cutting-edge AI technology with an enterprise-level platform and profound domain knowledge, we're setting new standards for how knowledge work is conducted for generations to come.This is a unique opportunity to be a part of a transformative journey at a pivotal moment for our company. With over 1,000 clients in more than 58 countries, a robust product-market fit, and exceptional investor backing, we are rapidly scaling and defining a new industry standard. The challenges are ambitious, the expectations are high, and the potential for personal, professional, and financial growth is unparalleled.Our team is composed of sharp, driven individuals who are deeply aligned with our mission. We operate with agility and intensity, taking ownership of the challenges we face—from initial brainstorming to long-term solutions. We engage closely with our clients, from executive leaders to engineers, collaborating to swiftly address real-world challenges with urgency and diligence. If you excel in uncertain environments, strive for excellence, and want to shape the future of work alongside high achievers, we encourage you to join our mission.At Harvey, we are writing the future of professional services today—and we’re just getting started.Role OverviewAs a Staff Software Engineer on our Site Reliability Engineering (SRE) team, you will play a crucial role in ensuring the reliability, scalability, and performance of our legal AI platform. You'll be part of a dynamic team that bridges infrastructure and product, taking ownership of systems that guarantee our platform is fast, secure, and consistently operational. Your efforts will be pivotal in scaling our operations across over 50 regions and in automating essential operational tasks. If you are enthusiastic about creating resilient systems and simplifying processes through automation, we would love to have you on board.This position is based in San Francisco, CA, and we follow an in-person work model, offering relocation assistance to new hires.
Full-time|$127K/yr - $249K/yr|Hybrid|United States
The TeamJoin our dynamic Platform Engineering team within Site Reliability Engineering (SRE), which is tasked with maintaining vital infrastructure and operational functions that empower our engineering organization. We manage multi-cloud Kubernetes infrastructures, deployment systems, and observability frameworks.The Fabric team specializes in ensuring secure communication between systems and the public internet. We focus on network architecture, service mesh, and edge load balancing, safeguarding customer data during transit. Our work is essential in building and sustaining a reliable, globally-connected multi-cloud network for MongoDB products.This position is available in our New York City headquarters, smaller offices in Austin, Palo Alto, and San Francisco, or as a fully remote role from anywhere in North America. Our hybrid work model accommodates both in-office and remote work.
About the RoleJoin alembic as a Senior Site Reliability Engineer (SRE) and become an integral part of our mission to enhance platform reliability, observability, and operational excellence. In this pivotal role, you will collaborate with engineers and data scientists to architect, automate, and maintain the robust infrastructure that drives our platform, including data pipelines, machine learning workloads, and real-time analytics systems.This hands-on position offers significant visibility across the technology stack and provides you with the opportunity to shape the future of our infrastructure and operations.
Join the Mercor TeamAt Mercor, we stand at the dynamic intersection of labor markets and AI research. Collaborating with premier AI labs and enterprises, we empower the human intelligence that is crucial for AI's evolution.Our expansive talent network plays a vital role in training cutting-edge AI models, akin to the way educators impart knowledge to their students—by sharing insights, experiences, and contextual understanding that code alone cannot convey. Currently, our network of over 30,000 experts generates more than $2 million daily.We are pioneering a novel category of work where expertise fuels AI progress. Achieving this vision necessitates an ambitious, fast-paced, and deeply dedicated team. You will collaborate with researchers, operators, and AI firms that are at the forefront of transforming societal structures.Mercor is a thriving Series C company with a valuation of $10 billion. We operate five days a week in-person at our new headquarters in San Francisco.About the RoleAs a Site Reliability Engineer (SRE) at Mercor, you will take ownership of production reliability for our critical systems, working closely with our infrastructure leadership. You will play a pivotal role in establishing our SRE function and defining how Mercor manages large-scale, high-availability systems.Your ResponsibilitiesEnsure the reliability and safety of production for key shared services and customer-facing systems.Collaborate directly with infrastructure leadership to outline SRE priorities, reliability benchmarks, and the production safety roadmap.Enhance the structure of our production systems to ensure stability, resource efficiency, isolation, and observability.Advocate for and implement modern SRE methodologies (e.g., incident management, postmortems, SLIs/SLOs) across engineering teams.Work alongside engineering and applied AI teams to facilitate sustainable growth.Promote SRE best practices internally, supporting teams in a safe, scalable, and consistent production onboarding process.Who We SeekThe ideal candidate will have:Extensive experience in genuine SRE roles (not merely operations) across various positions or organizations.A deep understanding of SRE methodologies popularized by Google (e.g., error budgets, reliability vs. risk trade-offs, large-scale distributed systems).5+ years of SRE experience; ideally, 15+ years in total experience for this inaugural SRE position.A proven track record of managing systems at scale, with a strong grasp of the complexities involved.
Full-time|$214K/yr - $260K/yr|Hybrid|Hub - San Francisco
At Superhuman, we embrace a vibrant hybrid work model that offers our team members the ideal blend of focused individual work and collaborative in-person interactions, fostering trust, innovation, and a robust team culture.About SuperhumanSuperhuman, the AI productivity platform, is on a transformative mission to unlock the superhuman potential within everyone. With the integration of Grammarly's writing assistance and innovative tools like Coda’s collaborative workspaces and Go, our proactive AI assistant, we empower over 40 million individuals and 50,000 organizations globally. Founded in 2009, we strive to eliminate busywork and enhance productivity. Discover more at superhuman.com and explore our values here.The OpportunityTo meet our ambitious goals, we are seeking a Site Reliability Engineer (SRE) to join our infrastructure team. This pivotal role focuses on developing software solutions to maintain the reliability of our back-end systems while collaborating with engineering teams to strategize our future growth. You will also engage with our production engineering teams in Europe as we transition from a “you build it, you own it” approach.At Superhuman, our engineers and researchers enjoy the autonomy to innovate and drive breakthroughs, directly impacting our product roadmap. As we rapidly scale our interfaces, algorithms, and infrastructure, the complexity of our technical challenges is growing. Learn more about our technical endeavors on our technical blog.As an SRE, your responsibilities will include:Scaling our Kubernetes-based control plane that processes billions of events each day.Enhancing our automation mechanisms to efficiently respond to workload demands.Deploying machine learning systems across various departments.
Why Join Harvey?At Harvey, we are revolutionizing the landscape of legal and professional services with a holistic approach. By integrating advanced AI technology, a robust enterprise platform, and extensive industry knowledge, we are redefining how essential knowledge work is conducted for years to come.This is a unique opportunity to contribute to the foundation of a transformative company at a pivotal moment in its journey. With over 1000 clients across more than 58 countries, a solid product-market fit, and outstanding investor backing, we are rapidly expanding and creating a new category in real-time. The challenges are significant, expectations are high, and the potential for personal, professional, and financial development is unparalleled.Our team comprises driven, intelligent individuals who are deeply passionate about our mission. We prioritize speed, intensity, and accountability in addressing challenges — from initial ideation to long-term solutions. By maintaining close relationships with our clients, from executives to engineers, we collaboratively address pressing issues with urgency and care. If you excel in uncertain environments, strive for excellence, and wish to shape the future of work alongside a team that raises the bar, we invite you to build alongside us.At Harvey, we are currently writing the future of professional services — and we are just getting started.Your RoleAs a Senior Software Engineer on the Site Reliability team at Harvey, your mission will be to uphold the reliability, scalability, and performance of our innovative legal AI platform. You will become part of a high-impact team that operates at the crossroads of infrastructure and product, taking ownership of the systems that ensure our platform remains fast, secure, and continuously available. From scaling operations across 50+ regions to automating critical processes, your efforts will fortify Harvey's resilience as we expand. If you are enthusiastic about constructing robust systems and simplifying complexity through automation, we would love to collaborate with you.This position is situated in San Francisco, CA, and we adhere to an in-person work model, providing relocation assistance to new employees.Your ResponsibilitiesDesign, implement, and oversee monitoring, alerting, and infrastructure resources (compute, storage, networking) across 50+ global regions.Lead incident management processes, including postmortems, root cause analyses, and driving actionable enhancements.Automate operational tasks and workflows by developing tools and processes for capacity planning, seamless rollouts, and secure data access to maintain high reliability and minimize manual intervention.Collaborate across teams to drive solutions that enhance system performance and reliability.
Full-time|On-site|San Francisco, California; Santa Clara, California; Seattle, WA
Join Carta as a Senior Site Reliability Engineer, where you will play a pivotal role in enhancing our infrastructure and ensuring the reliability of our platforms. You will work collaboratively with cross-functional teams to implement innovative solutions that drive operational excellence and scalability.
Role overview The Senior Site Reliability Engineer at prosper plays a key role in maintaining and improving the reliability and performance of the company’s core systems. Collaboration with teams across the organization is essential to ensure services remain stable and efficient. What you will do Design and set up monitoring tools to track the health and performance of systems Automate routine operational tasks to minimize manual intervention and boost efficiency Diagnose and resolve complex technical problems that impact infrastructure or services Support projects aimed at strengthening infrastructure stability and preparing for future growth Location This role is located in San Francisco, CA.
Join Our Team at EngFlowEngFlow is revolutionizing the software development process by enabling developers to save valuable time in their build and test cycles. Our innovative cloud-based distributed service optimizes workflows through advanced remote execution and caching, significantly enhancing efficiency, productivity, and product quality.Supported by esteemed investors, EngFlow is at the forefront of transforming how organizations develop software and deliver thoroughly tested products. Our solutions can accelerate builds by tenfold or more, and our observability platform provides crucial insights for ongoing optimization. Founded by leading contributors to Bazel, we create tools that empower engineering teams, from startups to Fortune 500 companies, to boost developer velocity and build performance.Discover more about our mission, culture, and team: EngFlow | Watch Our VideoWe are seeking a talented and experienced Site Reliability Engineer to join our dynamic engineering team. In this pivotal role, you will bridge the gap between software engineering and systems operations, ensuring our distributed infrastructure is highly available, performant, and scalable, thereby allowing our engineers to work swiftly and with confidence.
Join our innovative technology team at Twitter Inc. as a Senior Site Reliability Engineer. In this role, you will be pivotal in enhancing system reliability and performance, ensuring our services run smoothly and efficiently. We are seeking passionate engineers who thrive in a fast-paced environment and are eager to tackle challenging problems.
Full-time|$175K/yr - $320K/yr|On-site|San Francisco, CA
About FluidstackAt Fluidstack, we are pioneering the infrastructure for advanced intelligence. Collaborating with leading AI laboratories, governmental bodies, and enterprises such as Mistral, Poolside, Black Forest Labs, and Meta, we aim to unlock computational power at unprecedented speeds.Our mission is urgent: to turn Artificial General Intelligence (AGI) into a tangible reality. Our team is driven, dedicated to delivering top-tier infrastructure, and we treat the outcomes of our customers as if they were our own, taking immense pride in the systems we develop and the trust we establish. If you are purpose-driven, passionate about excellence, and ready to work diligently to propel the future of intelligence, we invite you to join us in shaping what comes next.About the RoleAs a Senior / Staff Site Reliability Engineer (SRE) at Fluidstack, you will be central to our infrastructure, working across software, hardware, and operations to ensure the reliability and performance of our global GPU cloud.You will collaborate closely with teams in networking, platform engineering, and data center operations to construct systems that can scale to meet the increasing demands of AI workloads.SREs at Fluidstack are hands-on experts with profound systems knowledge and excellent communication skills. Your responsibilities will include addressing complex production challenges, deploying robust infrastructure, and continuously enhancing the stability and observability of our platform as we expand.A typical day might involve:Deploying clusters of over 1,000 GPUs using custom playbooks and adjusting these tools to deliver optimal solutions for our clients.Validating the correctness and performance of our compute, storage, and networking infrastructure, while collaborating with providers to enhance these subsystems.Migrating petabytes of data from public cloud platforms to local storage, efficiently and cost-effectively.Troubleshooting issues across the stack, ranging from hardware problems like obstructed server fans to optimizing S3 data loaders across different regions.Creating internal tools to reduce deployment times and enhance cluster reliability, including automation where customer benefits clearly surpass implementation costs.This role will require participation in an on-call rotation of up to one week per month.
Site Reliability EngineerLocation: San Francisco, CA (5 Days In-Office)As a Site Reliability Engineer at Latent, you will be the backbone of our infrastructure, ensuring the exceptional stability and performance of our cutting-edge clinical AI platform that serves major health systems. Your role is pivotal in enhancing operational excellence, directly impacting patient access to critical treatments.What Makes a Great Engineer at LatentWe seek individuals who are not just technically skilled but also passionate about ownership and high standards. You will thrive in our dynamic, in-office culture where teamwork and a winning mentality are key.Tool Proficiency: You are highly adept with your tools, fluent in command line operations, and skilled in keyboard shortcuts.Ownership: You take pride in managing complex systems and have a successful history of scaling mission-critical deployments.Automation Drive: You have a passion for automation, consistently seeking innovative methods to enhance efficiency and establish operational excellence.Problem Solver: You proactively address challenges, stepping in to resolve issues without waiting for others.Your ResponsibilitiesAs our SRE, you will take full ownership of the production environment and enhance the developer experience:Infrastructure Ownership: Design, implement, and maintain a robust production environment, having experience with over 500 machine deployments.Kubernetes Mastery: Utilize your expertise in Kubernetes and Helm to manage our containerized infrastructure, ensuring optimal deployment, scalability, and operational health.CI/CD & Deployment Optimization: Streamline the deployment pipelines for TypeScript and Python/ML, supporting rapid feature releases while upholding top-notch reliability.DevX Support: Enhance developer workflows by supporting Developer Experience (DevX) initiatives to improve tool proficiency and CI/CD systems.Infrastructure as Code (IaC): Manage infrastructure definitions using Terraform.
Full-time|$162K/yr - $249K/yr|On-site|San Francisco, California
Okta is seeking a Staff Site Reliability Engineer to join the Infrastructure Platform AGILE SRE team in San Francisco. This position centers on supporting and improving the systems that underpin Okta’s identity infrastructure. Role overview The Staff SRE will work closely with multiple teams to develop and maintain critical infrastructure. A core part of this role involves enhancing internal tools and operational processes, ensuring that Okta’s systems remain secure and reliable as the company grows. What you will do Provide cross-functional support to teams building and maintaining key infrastructure components. Collaborate with Infrastructure Operations groups to address complex technical challenges. Diagnose, troubleshoot, and resolve sophisticated infrastructure issues by developing new tools and strategic solutions. Who we’re looking for Experienced SREs who are comfortable working on large-scale, impactful projects. Engineers who enjoy collaborating across teams and disciplines. Problem-solvers who can tackle intricate technical challenges and deliver reliable solutions. This role offers the chance to contribute directly to Okta’s mission of building secure, trusted infrastructure for organizations navigating the evolving landscape of AI and identity.
Full-time|$200.8K/yr - $250.9K/yr|On-site|San Francisco, California
About HeartFlow HeartFlow, Inc. is a medical technology company focused on improving the diagnosis and management of coronary artery disease. Our flagship product, the AI-powered HeartFlow FFRCT Analysis, provides a non-invasive, color-coded 3D view of a patient’s coronary arteries. Clinicians use our platform to identify blockages, assess blood flow, and analyze atherosclerosis, all in alignment with ACC/AHA Chest Pain Guidelines. HeartFlow’s technology supports care teams in the US, UK, Europe, Japan, and Canada, and has already impacted over 500,000 patients worldwide. As a publicly traded company (NASDAQ: HTFL), HeartFlow continues to expand its product line and modernize its platform to support the next generation of life-saving medical technologies. Role Overview: Staff/Lead Site Reliability Engineer (SRE) HeartFlow is searching for an experienced Site Reliability Engineer to join the cloud-native infrastructure team in San Francisco, California. This role works closely with Platform engineers and development teams to maintain and improve the reliability, scalability, observability, and performance of critical systems. What You Will Do Collaborate with Platform and development teams to ensure system reliability and performance Automate complex operational processes and reduce manual work Establish and promote standards for production excellence Support ongoing Platform Modernization initiatives Who We’re Looking For Extensive experience as a Site Reliability Engineer or in a similar role Strong background in cloud-native infrastructure Interest in automation, reliability, and scalable systems Comfort working with cross-functional engineering teams Location This position is based in San Francisco, California.
ABOUT BASETENBaseten is at the forefront of powering mission-critical AI inference for some of the most innovative companies globally, including Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer. We integrate cutting-edge applied AI research with a flexible infrastructure and intuitive developer tools to empower companies at the leading edge of AI to deploy sophisticated models effectively. With our recent $300M Series E funding round—supported by prominent investors such as BOND, IVP, Spark Capital, Greylock, and Conviction—we are rapidly expanding. Join our dynamic team and contribute to creating an essential platform for engineers to launch AI products with ease.THE ROLEAs a Site Reliability Engineer, you will design and implement resilient systems and processes that ensure our infrastructure is scalable, reliable, and efficient. Your responsibilities will encompass everything from automating deployments and monitoring systems to enhancing performance and managing incidents effectively.Collaboration is key; you will work closely with our users to understand their challenges in operationalizing machine learning, facilitating their onboarding onto our platform, and leveraging these insights to inform improvements to Baseten.EXAMPLE INITIATIVESAs part of our Infrastructure team, you will engage in exciting projects such as:Innovative multi-cloud capacity managementOptimizing inference on B200 GPUsImplementing multi-node inferenceUtilizing fractional H100 GPUs for efficient model servingRESPONSIBILITIESDesign and maintain scalable infrastructures to support the deployment and operational needs of machine learning models.Establish standards and best practices to enhance reliability and performance across the infrastructure.Proactively identify and resolve reliability issues using monitoring and alerting systems.Collaborate with cross-functional teams to apply best practices in infrastructure management and incident response.Create automation scripts to streamline processes and reduce manual intervention.
About UsAt Sierra, we are pioneering a transformative platform that empowers businesses to forge authentic customer experiences through AI technology. Headquartered in the vibrant city of San Francisco, we also boast a dynamic presence in Atlanta, New York, London, France, Singapore, and Japan.Our operations are anchored in core values that shape our culture: Trust, Customer Obsession, Craftsmanship, Intensity, and Family. These principles guide our actions and are integral to our mission.Our visionary founders, Bret Taylor and Clay Bavor, bring unparalleled expertise. Bret, currently the Board Chair of OpenAI, previously co-led Salesforce and served as CTO at Facebook, while Clay led numerous initiatives at Google, including AR/VR projects and Google Workspace.Your RoleIn your capacity as a Software Engineer on the Site Reliability team, you will play a crucial role in establishing and enhancing the reliability, observability, and scalability of Sierra’s AI-centric infrastructure. Collaborating closely with our engineering and product teams, your goal is to ensure our systems remain highly available, efficient, and primed for growth.Lead the development of Sierra’s observability stack—including monitoring, alerting, logging, and tracing—to provide engineers with critical insights into system health and performance.Collaborate with product and platform engineers to architect systems that prioritize reliability and scalability from the outset, not as an afterthought.Design and implement robust, scalable, and secure cloud infrastructure on AWS, employing Terraform and cutting-edge DevOps tools.Enhance the reliability and scalability of our LLM deployments, ensuring they operate efficiently and cost-effectively.Drive improvements in deployment pipelines, CI/CD tooling, and incident management processes to minimize downtime and accelerate response times.Define and cultivate SRE practices within Sierra, shaping culture, tooling, and best practices across the engineering organization.QualificationsBachelor's degree in Computer Science or a related field, or equivalent experience.Proven experience in Site Reliability Engineering or a similar role, with a strong understanding of cloud infrastructure (AWS).Proficiency in Terraform and modern DevOps practices.Experience with observability tools and techniques—monitoring, alerting, logging, and tracing.Strong problem-solving skills with a focus on scalability and performance optimization.Excellent collaboration and communication skills, with the ability to work effectively in a team environment.
Join our dynamic team at fal as a Senior/Staff Site Reliability Engineer. In this key role, you will leverage your expertise to enhance our systems' reliability and performance. If you are passionate about building scalable systems and enjoy working in a collaborative environment, we want to hear from you!
Full-time|On-site|North / Central / South America (in-person San Francisco Preference)
About ABC Labs:Reserve is an innovative cryptocurrency project pioneering the asset-backed currency revolution. ABC Labs developed Reserve to empower individuals to launch, mint, and redeem on-chain crypto indexes known as Decentralized Token Folios (DTFs) using robust, safety-first smart contracts. Experience expansive crypto exposure, earn effortless DeFi yield, or help create the next world reserve currency. Currently, only 0.03% of crypto is in indexes, a number we anticipate will grow rapidly as the DTF space expands, and Reserve is at the forefront of this movement.As we continue to tokenize real-world assets, we envision a protocol that facilitates new asset-backed currencies that are largely or fully independent of fiat currencies. Learn more about our vision and the Reserve team here.ABC Labs plays a vital role in the development of the Reserve protocol, contributing to the growth and sustainability of the Reserve ecosystem.Role Summary:We are seeking a highly skilled engineer to join our expanding protocol engineering team. You will work with the Ethereum mainnet and its Layer 2 solutions, custom APIs, data pipelines, Docker, Cloudflare, metrics software, and other tools to build and maintain a scalable backend infrastructure. Your focus will be on ensuring our frontend UI, backend APIs, and developer operations maintain exceptional reliability and scalability. Users deserve an intuitive, seamless DeFi experience without compromising security or decentralization. The ideal candidate possesses full-stack experience, specializes as an SRE, and has a passion for scaling operations. Leadership capabilities to guide a small team toward achieving our goals will be essential. As a startup, team members often wear multiple hats, but being an outstanding SRE is your primary responsibility.Our Tech Stack:Bare metal serversLinux (Ubuntu)DockerRedisPostgreSQLCloudflareTypeScriptRustResponsibilities:Provision, configure, and secure Linux servers, preferably through automationManage blockchain nodes to ensure maximum uptimeDeploy and configure monitoring tools such as Prometheus & GrafanaConduct load testing to identify and resolve bottlenecks in our APIOversee fleets of Docker containers using Dokku, Swarm, or Kubernetes
Full-time|$170K/yr - $230K/yr|On-site|Palo Alto / San Francisco Bay Area
Mithril develops AI infrastructure aimed at making GPU computing more accessible and affordable for enterprises, AI startups, and researchers. Clients include LG AI Research, Saronic, and the Broad Institute. The company was founded by a former Google DeepMind research scientist and a Stanford CS PhD. Mithril has secured $80M in seed and Series A funding from Sequoia Capital and Lightspeed Venture Partners. Over the past year, platform revenue has grown more than sixfold. Fast Company recognized Mithril as the 8th Most Innovative Company in Artificial Intelligence for 2026. The engineering team at Mithril is small, with each member making a significant impact. This Site Reliability Engineer (SRE) position is a foundational role focused on shaping how the platform scales across a multi-cloud environment. Role overview This SRE will play a central role in keeping Mithril's global GPU orchestration platform stable and high-performing. The responsibilities extend beyond day-to-day maintenance. The primary focus is on designing and building automation, observability, and tooling to help manage advanced compute resources across multiple cloud providers. The goal is to ensure customers have fast and dependable access to infrastructure. Collaboration with Mithril's founding team is central to this job. The SRE will help set service level objectives (SLOs), orchestrate capacity, and make influential infrastructure decisions, gaining visibility into both technical and commercial aspects of the business. What makes this SRE role unique This position differs from many early-stage SRE roles that focus mainly on on-call rotations and incident response. Here, the emphasis is on building infrastructure that actively shapes Mithril's marketplace. The systems developed will determine how supply is sourced, allocated, and monitored across providers, directly affecting customer experience and company revenue. The role offers genuine ownership, a fast feedback loop with leadership, and the opportunity to define how infrastructure engineering evolves as Mithril grows. Core responsibilities About 70–75% of the work centers on platform reliability and infrastructure automation. Reliability & SLOs Implement and manage service level indicators (SLIs) and service level objectives (SLOs) for Mithril's API layer and internal orchestration services to maintain high reliability and performance.
Full-time|$194K/yr - $267K/yr|On-site|San Francisco, California
Discover OktaOkta is recognized as The World’s Identity Company, empowering individuals to securely leverage any technology across various devices and applications. Our versatile Okta Platform and Auth0 Platform provide reliable access, authentication, and automation, placing identity at the forefront of business security and expansion.At Okta, we value diverse perspectives and experiences. We seek continuous learners and individuals who can enhance our team with their distinct backgrounds.Join us as we create a world where identity is truly yours.We are in search of a highly skilled Observability Site Reliability Engineer specializing in Google Cloud, to take charge of and elevate our Observability ecosystem within GCP. In this position, you will progress beyond basic monitoring to develop a world-class, comprehensive, and scalable Observability Platform that supports our SRE teams and business collaborators. You will implement infrastructure as code by employing Terraform and demonstrating strong coding skills in Go, Python, or Ruby to automate the deployment of agents and collectors across intricate distributed systems.Key ResponsibilitiesAutomated Infrastructure: Design, build, and maintain scalable observability infrastructure utilizing tools such as Terraform.GCP Observability Engineering: Enhance the collection, processing, and storage of Observability data to guarantee high reliability and low latency for our Splunk and Grafana services.Incident Response: Engage in on-call rotations and conduct post-incident reviews to foster systemic improvements and promote 'observability-driven development.'Automation: Minimize 'toil' by automating the deployment and scaling of observability agents and collectors.
Why Join Harvey?At Harvey, we're not just changing the landscape of legal and professional services; we're revolutionizing it from the ground up. By integrating cutting-edge AI technology with an enterprise-level platform and profound domain knowledge, we're setting new standards for how knowledge work is conducted for generations to come.This is a unique opportunity to be a part of a transformative journey at a pivotal moment for our company. With over 1,000 clients in more than 58 countries, a robust product-market fit, and exceptional investor backing, we are rapidly scaling and defining a new industry standard. The challenges are ambitious, the expectations are high, and the potential for personal, professional, and financial growth is unparalleled.Our team is composed of sharp, driven individuals who are deeply aligned with our mission. We operate with agility and intensity, taking ownership of the challenges we face—from initial brainstorming to long-term solutions. We engage closely with our clients, from executive leaders to engineers, collaborating to swiftly address real-world challenges with urgency and diligence. If you excel in uncertain environments, strive for excellence, and want to shape the future of work alongside high achievers, we encourage you to join our mission.At Harvey, we are writing the future of professional services today—and we’re just getting started.Role OverviewAs a Staff Software Engineer on our Site Reliability Engineering (SRE) team, you will play a crucial role in ensuring the reliability, scalability, and performance of our legal AI platform. You'll be part of a dynamic team that bridges infrastructure and product, taking ownership of systems that guarantee our platform is fast, secure, and consistently operational. Your efforts will be pivotal in scaling our operations across over 50 regions and in automating essential operational tasks. If you are enthusiastic about creating resilient systems and simplifying processes through automation, we would love to have you on board.This position is based in San Francisco, CA, and we follow an in-person work model, offering relocation assistance to new hires.
Full-time|$127K/yr - $249K/yr|Hybrid|United States
The TeamJoin our dynamic Platform Engineering team within Site Reliability Engineering (SRE), which is tasked with maintaining vital infrastructure and operational functions that empower our engineering organization. We manage multi-cloud Kubernetes infrastructures, deployment systems, and observability frameworks.The Fabric team specializes in ensuring secure communication between systems and the public internet. We focus on network architecture, service mesh, and edge load balancing, safeguarding customer data during transit. Our work is essential in building and sustaining a reliable, globally-connected multi-cloud network for MongoDB products.This position is available in our New York City headquarters, smaller offices in Austin, Palo Alto, and San Francisco, or as a fully remote role from anywhere in North America. Our hybrid work model accommodates both in-office and remote work.
About the RoleJoin alembic as a Senior Site Reliability Engineer (SRE) and become an integral part of our mission to enhance platform reliability, observability, and operational excellence. In this pivotal role, you will collaborate with engineers and data scientists to architect, automate, and maintain the robust infrastructure that drives our platform, including data pipelines, machine learning workloads, and real-time analytics systems.This hands-on position offers significant visibility across the technology stack and provides you with the opportunity to shape the future of our infrastructure and operations.
Join the Mercor TeamAt Mercor, we stand at the dynamic intersection of labor markets and AI research. Collaborating with premier AI labs and enterprises, we empower the human intelligence that is crucial for AI's evolution.Our expansive talent network plays a vital role in training cutting-edge AI models, akin to the way educators impart knowledge to their students—by sharing insights, experiences, and contextual understanding that code alone cannot convey. Currently, our network of over 30,000 experts generates more than $2 million daily.We are pioneering a novel category of work where expertise fuels AI progress. Achieving this vision necessitates an ambitious, fast-paced, and deeply dedicated team. You will collaborate with researchers, operators, and AI firms that are at the forefront of transforming societal structures.Mercor is a thriving Series C company with a valuation of $10 billion. We operate five days a week in-person at our new headquarters in San Francisco.About the RoleAs a Site Reliability Engineer (SRE) at Mercor, you will take ownership of production reliability for our critical systems, working closely with our infrastructure leadership. You will play a pivotal role in establishing our SRE function and defining how Mercor manages large-scale, high-availability systems.Your ResponsibilitiesEnsure the reliability and safety of production for key shared services and customer-facing systems.Collaborate directly with infrastructure leadership to outline SRE priorities, reliability benchmarks, and the production safety roadmap.Enhance the structure of our production systems to ensure stability, resource efficiency, isolation, and observability.Advocate for and implement modern SRE methodologies (e.g., incident management, postmortems, SLIs/SLOs) across engineering teams.Work alongside engineering and applied AI teams to facilitate sustainable growth.Promote SRE best practices internally, supporting teams in a safe, scalable, and consistent production onboarding process.Who We SeekThe ideal candidate will have:Extensive experience in genuine SRE roles (not merely operations) across various positions or organizations.A deep understanding of SRE methodologies popularized by Google (e.g., error budgets, reliability vs. risk trade-offs, large-scale distributed systems).5+ years of SRE experience; ideally, 15+ years in total experience for this inaugural SRE position.A proven track record of managing systems at scale, with a strong grasp of the complexities involved.
Full-time|$214K/yr - $260K/yr|Hybrid|Hub - San Francisco
At Superhuman, we embrace a vibrant hybrid work model that offers our team members the ideal blend of focused individual work and collaborative in-person interactions, fostering trust, innovation, and a robust team culture.About SuperhumanSuperhuman, the AI productivity platform, is on a transformative mission to unlock the superhuman potential within everyone. With the integration of Grammarly's writing assistance and innovative tools like Coda’s collaborative workspaces and Go, our proactive AI assistant, we empower over 40 million individuals and 50,000 organizations globally. Founded in 2009, we strive to eliminate busywork and enhance productivity. Discover more at superhuman.com and explore our values here.The OpportunityTo meet our ambitious goals, we are seeking a Site Reliability Engineer (SRE) to join our infrastructure team. This pivotal role focuses on developing software solutions to maintain the reliability of our back-end systems while collaborating with engineering teams to strategize our future growth. You will also engage with our production engineering teams in Europe as we transition from a “you build it, you own it” approach.At Superhuman, our engineers and researchers enjoy the autonomy to innovate and drive breakthroughs, directly impacting our product roadmap. As we rapidly scale our interfaces, algorithms, and infrastructure, the complexity of our technical challenges is growing. Learn more about our technical endeavors on our technical blog.As an SRE, your responsibilities will include:Scaling our Kubernetes-based control plane that processes billions of events each day.Enhancing our automation mechanisms to efficiently respond to workload demands.Deploying machine learning systems across various departments.
Why Join Harvey?At Harvey, we are revolutionizing the landscape of legal and professional services with a holistic approach. By integrating advanced AI technology, a robust enterprise platform, and extensive industry knowledge, we are redefining how essential knowledge work is conducted for years to come.This is a unique opportunity to contribute to the foundation of a transformative company at a pivotal moment in its journey. With over 1000 clients across more than 58 countries, a solid product-market fit, and outstanding investor backing, we are rapidly expanding and creating a new category in real-time. The challenges are significant, expectations are high, and the potential for personal, professional, and financial development is unparalleled.Our team comprises driven, intelligent individuals who are deeply passionate about our mission. We prioritize speed, intensity, and accountability in addressing challenges — from initial ideation to long-term solutions. By maintaining close relationships with our clients, from executives to engineers, we collaboratively address pressing issues with urgency and care. If you excel in uncertain environments, strive for excellence, and wish to shape the future of work alongside a team that raises the bar, we invite you to build alongside us.At Harvey, we are currently writing the future of professional services — and we are just getting started.Your RoleAs a Senior Software Engineer on the Site Reliability team at Harvey, your mission will be to uphold the reliability, scalability, and performance of our innovative legal AI platform. You will become part of a high-impact team that operates at the crossroads of infrastructure and product, taking ownership of the systems that ensure our platform remains fast, secure, and continuously available. From scaling operations across 50+ regions to automating critical processes, your efforts will fortify Harvey's resilience as we expand. If you are enthusiastic about constructing robust systems and simplifying complexity through automation, we would love to collaborate with you.This position is situated in San Francisco, CA, and we adhere to an in-person work model, providing relocation assistance to new employees.Your ResponsibilitiesDesign, implement, and oversee monitoring, alerting, and infrastructure resources (compute, storage, networking) across 50+ global regions.Lead incident management processes, including postmortems, root cause analyses, and driving actionable enhancements.Automate operational tasks and workflows by developing tools and processes for capacity planning, seamless rollouts, and secure data access to maintain high reliability and minimize manual intervention.Collaborate across teams to drive solutions that enhance system performance and reliability.
Full-time|On-site|San Francisco, California; Santa Clara, California; Seattle, WA
Join Carta as a Senior Site Reliability Engineer, where you will play a pivotal role in enhancing our infrastructure and ensuring the reliability of our platforms. You will work collaboratively with cross-functional teams to implement innovative solutions that drive operational excellence and scalability.
Role overview The Senior Site Reliability Engineer at prosper plays a key role in maintaining and improving the reliability and performance of the company’s core systems. Collaboration with teams across the organization is essential to ensure services remain stable and efficient. What you will do Design and set up monitoring tools to track the health and performance of systems Automate routine operational tasks to minimize manual intervention and boost efficiency Diagnose and resolve complex technical problems that impact infrastructure or services Support projects aimed at strengthening infrastructure stability and preparing for future growth Location This role is located in San Francisco, CA.
Join Our Team at EngFlowEngFlow is revolutionizing the software development process by enabling developers to save valuable time in their build and test cycles. Our innovative cloud-based distributed service optimizes workflows through advanced remote execution and caching, significantly enhancing efficiency, productivity, and product quality.Supported by esteemed investors, EngFlow is at the forefront of transforming how organizations develop software and deliver thoroughly tested products. Our solutions can accelerate builds by tenfold or more, and our observability platform provides crucial insights for ongoing optimization. Founded by leading contributors to Bazel, we create tools that empower engineering teams, from startups to Fortune 500 companies, to boost developer velocity and build performance.Discover more about our mission, culture, and team: EngFlow | Watch Our VideoWe are seeking a talented and experienced Site Reliability Engineer to join our dynamic engineering team. In this pivotal role, you will bridge the gap between software engineering and systems operations, ensuring our distributed infrastructure is highly available, performant, and scalable, thereby allowing our engineers to work swiftly and with confidence.
Join our innovative technology team at Twitter Inc. as a Senior Site Reliability Engineer. In this role, you will be pivotal in enhancing system reliability and performance, ensuring our services run smoothly and efficiently. We are seeking passionate engineers who thrive in a fast-paced environment and are eager to tackle challenging problems.
Full-time|$175K/yr - $320K/yr|On-site|San Francisco, CA
About FluidstackAt Fluidstack, we are pioneering the infrastructure for advanced intelligence. Collaborating with leading AI laboratories, governmental bodies, and enterprises such as Mistral, Poolside, Black Forest Labs, and Meta, we aim to unlock computational power at unprecedented speeds.Our mission is urgent: to turn Artificial General Intelligence (AGI) into a tangible reality. Our team is driven, dedicated to delivering top-tier infrastructure, and we treat the outcomes of our customers as if they were our own, taking immense pride in the systems we develop and the trust we establish. If you are purpose-driven, passionate about excellence, and ready to work diligently to propel the future of intelligence, we invite you to join us in shaping what comes next.About the RoleAs a Senior / Staff Site Reliability Engineer (SRE) at Fluidstack, you will be central to our infrastructure, working across software, hardware, and operations to ensure the reliability and performance of our global GPU cloud.You will collaborate closely with teams in networking, platform engineering, and data center operations to construct systems that can scale to meet the increasing demands of AI workloads.SREs at Fluidstack are hands-on experts with profound systems knowledge and excellent communication skills. Your responsibilities will include addressing complex production challenges, deploying robust infrastructure, and continuously enhancing the stability and observability of our platform as we expand.A typical day might involve:Deploying clusters of over 1,000 GPUs using custom playbooks and adjusting these tools to deliver optimal solutions for our clients.Validating the correctness and performance of our compute, storage, and networking infrastructure, while collaborating with providers to enhance these subsystems.Migrating petabytes of data from public cloud platforms to local storage, efficiently and cost-effectively.Troubleshooting issues across the stack, ranging from hardware problems like obstructed server fans to optimizing S3 data loaders across different regions.Creating internal tools to reduce deployment times and enhance cluster reliability, including automation where customer benefits clearly surpass implementation costs.This role will require participation in an on-call rotation of up to one week per month.
Site Reliability EngineerLocation: San Francisco, CA (5 Days In-Office)As a Site Reliability Engineer at Latent, you will be the backbone of our infrastructure, ensuring the exceptional stability and performance of our cutting-edge clinical AI platform that serves major health systems. Your role is pivotal in enhancing operational excellence, directly impacting patient access to critical treatments.What Makes a Great Engineer at LatentWe seek individuals who are not just technically skilled but also passionate about ownership and high standards. You will thrive in our dynamic, in-office culture where teamwork and a winning mentality are key.Tool Proficiency: You are highly adept with your tools, fluent in command line operations, and skilled in keyboard shortcuts.Ownership: You take pride in managing complex systems and have a successful history of scaling mission-critical deployments.Automation Drive: You have a passion for automation, consistently seeking innovative methods to enhance efficiency and establish operational excellence.Problem Solver: You proactively address challenges, stepping in to resolve issues without waiting for others.Your ResponsibilitiesAs our SRE, you will take full ownership of the production environment and enhance the developer experience:Infrastructure Ownership: Design, implement, and maintain a robust production environment, having experience with over 500 machine deployments.Kubernetes Mastery: Utilize your expertise in Kubernetes and Helm to manage our containerized infrastructure, ensuring optimal deployment, scalability, and operational health.CI/CD & Deployment Optimization: Streamline the deployment pipelines for TypeScript and Python/ML, supporting rapid feature releases while upholding top-notch reliability.DevX Support: Enhance developer workflows by supporting Developer Experience (DevX) initiatives to improve tool proficiency and CI/CD systems.Infrastructure as Code (IaC): Manage infrastructure definitions using Terraform.
Dec 5, 2025
Sign in to browse more jobs
Create account — see all 5,494 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.