Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Mid to Senior
Qualifications
Proven experience in site reliability engineering or related fields. Strong understanding of cloud services, particularly AWS or Azure. Experience with automation tools and scripting languages. Excellent problem-solving skills and ability to work under pressure. Ability to collaborate with cross-functional teams effectively.
About the job
Join our innovative technology team at Twitter Inc. as a Senior Site Reliability Engineer. In this role, you will be pivotal in enhancing system reliability and performance, ensuring our services run smoothly and efficiently. We are seeking passionate engineers who thrive in a fast-paced environment and are eager to tackle challenging problems.
About Twitter Inc.
Twitter Inc. is a leading social media platform that connects millions of people worldwide. Our mission is to give everyone the power to create and share ideas instantly, without barriers. We foster an inclusive workplace that encourages innovation and values the contributions of all employees.
Join our dynamic team at fal as a Senior/Staff Site Reliability Engineer. In this key role, you will leverage your expertise to enhance our systems' reliability and performance. If you are passionate about building scalable systems and enjoy working in a collaborative environment, we want to hear from you!
Full-time|$166.9K/yr - $225.9K/yr|Hybrid|Hybrid - San Francisco
Drata helps organizations demonstrate their commitment to security and integrity. The platform supports companies as they build and maintain trust with users, customers, partners, and prospects. Values Built on Trust: Consistency shapes decisions and actions. Integrity: Choosing to do what is right, every time. Customer-Obsessed: Prioritizing customer needs above all else. Competitive Fire: Striving for higher standards and greater achievements. Diversity: Welcoming different perspectives to encourage creative solutions. Automation First: Pursuing efficiency by saving time and resources wherever possible. How the Team Works Drata blends high standards with a supportive environment focused on growth. Team members are encouraged to own their work, improve continuously, and deliver meaningful results. The company values quick, informed decisions that drive immediate impact, while always keeping the mission and customer needs at the center. The San Francisco-based team uses a hybrid work model. Colleagues collaborate in the office Tuesday through Thursday, focusing on alignment and innovation. Mondays and Fridays offer flexibility for deep work or personal needs. Growth and Culture Drata has expanded to over 600 professionals worldwide, recognized for a culture that values trust, speed, and continuous learning. The environment supports both personal and professional development. See the Speed: CEO Adam Markowitz discusses Drata’s rapid journey to $100M ARR in four years. Hear the Voice of the Team: Employee stories highlight collaboration and growth at Drata.
Full-time|On-site|North / Central / South America (in-person San Francisco Preference)
About ABC Labs:Reserve is an innovative cryptocurrency project pioneering the asset-backed currency revolution. ABC Labs developed Reserve to empower individuals to launch, mint, and redeem on-chain crypto indexes known as Decentralized Token Folios (DTFs) using robust, safety-first smart contracts. Experience expansive crypto exposure, earn effortless DeFi yield, or help create the next world reserve currency. Currently, only 0.03% of crypto is in indexes, a number we anticipate will grow rapidly as the DTF space expands, and Reserve is at the forefront of this movement.As we continue to tokenize real-world assets, we envision a protocol that facilitates new asset-backed currencies that are largely or fully independent of fiat currencies. Learn more about our vision and the Reserve team here.ABC Labs plays a vital role in the development of the Reserve protocol, contributing to the growth and sustainability of the Reserve ecosystem.Role Summary:We are seeking a highly skilled engineer to join our expanding protocol engineering team. You will work with the Ethereum mainnet and its Layer 2 solutions, custom APIs, data pipelines, Docker, Cloudflare, metrics software, and other tools to build and maintain a scalable backend infrastructure. Your focus will be on ensuring our frontend UI, backend APIs, and developer operations maintain exceptional reliability and scalability. Users deserve an intuitive, seamless DeFi experience without compromising security or decentralization. The ideal candidate possesses full-stack experience, specializes as an SRE, and has a passion for scaling operations. Leadership capabilities to guide a small team toward achieving our goals will be essential. As a startup, team members often wear multiple hats, but being an outstanding SRE is your primary responsibility.Our Tech Stack:Bare metal serversLinux (Ubuntu)DockerRedisPostgreSQLCloudflareTypeScriptRustResponsibilities:Provision, configure, and secure Linux servers, preferably through automationManage blockchain nodes to ensure maximum uptimeDeploy and configure monitoring tools such as Prometheus & GrafanaConduct load testing to identify and resolve bottlenecks in our APIOversee fleets of Docker containers using Dokku, Swarm, or Kubernetes
Full-time|$194K/yr - $267K/yr|On-site|San Francisco, California
Discover OktaOkta is recognized as The World’s Identity Company, empowering individuals to securely leverage any technology across various devices and applications. Our versatile Okta Platform and Auth0 Platform provide reliable access, authentication, and automation, placing identity at the forefront of business security and expansion.At Okta, we value diverse perspectives and experiences. We seek continuous learners and individuals who can enhance our team with their distinct backgrounds.Join us as we create a world where identity is truly yours.We are in search of a highly skilled Observability Site Reliability Engineer specializing in Google Cloud, to take charge of and elevate our Observability ecosystem within GCP. In this position, you will progress beyond basic monitoring to develop a world-class, comprehensive, and scalable Observability Platform that supports our SRE teams and business collaborators. You will implement infrastructure as code by employing Terraform and demonstrating strong coding skills in Go, Python, or Ruby to automate the deployment of agents and collectors across intricate distributed systems.Key ResponsibilitiesAutomated Infrastructure: Design, build, and maintain scalable observability infrastructure utilizing tools such as Terraform.GCP Observability Engineering: Enhance the collection, processing, and storage of Observability data to guarantee high reliability and low latency for our Splunk and Grafana services.Incident Response: Engage in on-call rotations and conduct post-incident reviews to foster systemic improvements and promote 'observability-driven development.'Automation: Minimize 'toil' by automating the deployment and scaling of observability agents and collectors.
Join Unify as a Senior Staff Site Reliability Engineer and take the lead in transforming our technology landscape. In this pivotal role, you will spearhead initiatives to enhance our system reliability and performance, ensuring seamless operations across our platforms. Your expertise will guide a dynamic team, driving innovation and implementing best practices in site reliability engineering.
About the RoleJoin alembic as a Senior Site Reliability Engineer (SRE) and become an integral part of our mission to enhance platform reliability, observability, and operational excellence. In this pivotal role, you will collaborate with engineers and data scientists to architect, automate, and maintain the robust infrastructure that drives our platform, including data pipelines, machine learning workloads, and real-time analytics systems.This hands-on position offers significant visibility across the technology stack and provides you with the opportunity to shape the future of our infrastructure and operations.
Who We AreAt Hyperbolic Labs, we are committed to democratizing AI by removing barriers to computing power with our Open-Access AI Cloud. By aggregating global computing resources, we provide an innovative GPU marketplace and AI inference service that ensures both affordability and accessibility. As trailblazers at the convergence of AI and open-source technology, we envision a future where AI innovation is only limited by creativity, not by resource availability. We invite forward-thinking individuals who share our dedication to making AI universally accessible, secure, and affordable. Join us in crafting a platform that empowers innovators worldwide to realize their visionary AI projects.In anticipation of our growth following our Series A funding, our team — guided by co-founders with advanced degrees in AI, Mathematics, and Computer Science — is set to transform the computing landscape.About the RoleWe are in search of a skilled Site Reliability Engineer to guarantee that Hyperbolic's GPU marketplace and AI infrastructure function with outstanding reliability, performance, and security. As an aggregator of computational resources from numerous global providers, our service level objectives (SLOs), trust, and economic efficiency are critical to our product. Your key responsibilities will include defining and maintaining service level objectives, developing resilient incident response protocols, managing capacity across our extensive GPU network, and implementing secure rollout and rollback mechanisms to ensure uninterrupted platform operation around the clock.In this influential role, you'll set the reliability benchmarks that foster customer trust in our platform, design comprehensive monitoring and alerting systems for enhanced infrastructure visibility, automate capacity management and resource allocation processes, lead incident response and post-mortem evaluations, and collaborate closely with engineering teams to bolster system resilience. Security and infrastructure hardening will be paramount, necessitating strong isolation protocols between tenants and suppliers, the implementation of effective key management systems, and the establishment of compliance frameworks. This high-impact position will directly affect our ability to deliver on our commitment to providing affordable, accessible AI compute at scale.
Full-time|$227.2K/yr - $324.5K/yr|Hybrid|San Francisco, CA (Hybrid)
About the Role: At Tubi, our Site Reliability Engineering (SRE) team transcends traditional operations. We embody a software engineering ethos, leveraging a developer's toolkit to tackle the complexities of large-scale, distributed systems. Our core mission focuses on building resilience from the ground up, empowering our product teams to innovate swiftly while delivering an exceptional user experience. We oversee the availability, latency, performance, and capacity of our platform, driven by a culture of data-informed decision-making, blameless learning, and relentless automation. We are on the lookout for a seasoned and visionary Senior Manager of SRE to lead and expand our newly formed Site Reliability Engineering team. You will be more than just a people manager or tech lead; you will be the strategic architect behind our reliability roadmap. Your role will involve building and mentoring a team of skilled engineers, cultivating an environment of blameless learning and continuous improvement, while advocating for the engineering practices that balance rapid innovation with unwavering stability. You will play a pivotal role within our engineering leadership, collaborating with peers across the organization to embed reliability as a shared responsibility and a fundamental principle of our engineering culture.
About HiveHive stands at the forefront of cloud-based AI innovation, providing cutting-edge solutions that enable organizations to understand, search, and generate content. Our platform is relied upon by some of the world's most prestigious and forward-thinking companies. We empower developers with an extensive suite of state-of-the-art, pre-trained AI models that handle billions of API requests each month. In addition to our robust model offerings, we deliver comprehensive software applications backed by proprietary AI models and datasets, unlocking transformative applications in various sectors such as content moderation, brand protection, sponsorship measurement, and context-based advertising.With over $120 million in funding from esteemed investors like General Catalyst, 8VC, Glynn Capital, Bain & Company, and Visa Ventures, Hive has cultivated a vibrant global team of over 250 employees across our San Francisco, Seattle, and Delhi offices. If you’re passionate about shaping the future of AI, we invite you to join our dynamic team!DevOps and Systems TeamIn response to our distinctive machine learning demands, we have developed our own data centers focusing on distributed high-performance computing with GPU integration. While we harness the power of these data centers, our infrastructure remains hybrid, leveraging public cloud solutions when advantageous. As we scale our machine learning models for commercial use, we are expanding our DevOps and Site Reliability team to ensure the reliability of our enterprise SaaS offerings. Our ideal candidate thrives in dynamic environments, embraces automation, and believes that every task can be automated and every server can scale. You take pride in enhancing performance across all layers of our stack and are committed to never performing the same task manually twice.
Full-time|On-site|San Francisco, California; Santa Clara, California; Seattle, WA
Join Carta as a Senior Site Reliability Engineer, where you will play a pivotal role in enhancing our infrastructure and ensuring the reliability of our platforms. You will work collaboratively with cross-functional teams to implement innovative solutions that drive operational excellence and scalability.
Full-time|$127K/yr - $249K/yr|Hybrid|United States
The TeamJoin our dynamic Platform Engineering team within Site Reliability Engineering (SRE), which is tasked with maintaining vital infrastructure and operational functions that empower our engineering organization. We manage multi-cloud Kubernetes infrastructures, deployment systems, and observability frameworks.The Fabric team specializes in ensuring secure communication between systems and the public internet. We focus on network architecture, service mesh, and edge load balancing, safeguarding customer data during transit. Our work is essential in building and sustaining a reliable, globally-connected multi-cloud network for MongoDB products.This position is available in our New York City headquarters, smaller offices in Austin, Palo Alto, and San Francisco, or as a fully remote role from anywhere in North America. Our hybrid work model accommodates both in-office and remote work.
Full-time|$200.8K/yr - $250.9K/yr|On-site|San Francisco, California
About HeartFlow HeartFlow, Inc. is a medical technology company focused on improving the diagnosis and management of coronary artery disease. Our flagship product, the AI-powered HeartFlow FFRCT Analysis, provides a non-invasive, color-coded 3D view of a patient’s coronary arteries. Clinicians use our platform to identify blockages, assess blood flow, and analyze atherosclerosis, all in alignment with ACC/AHA Chest Pain Guidelines. HeartFlow’s technology supports care teams in the US, UK, Europe, Japan, and Canada, and has already impacted over 500,000 patients worldwide. As a publicly traded company (NASDAQ: HTFL), HeartFlow continues to expand its product line and modernize its platform to support the next generation of life-saving medical technologies. Role Overview: Staff/Lead Site Reliability Engineer (SRE) HeartFlow is searching for an experienced Site Reliability Engineer to join the cloud-native infrastructure team in San Francisco, California. This role works closely with Platform engineers and development teams to maintain and improve the reliability, scalability, observability, and performance of critical systems. What You Will Do Collaborate with Platform and development teams to ensure system reliability and performance Automate complex operational processes and reduce manual work Establish and promote standards for production excellence Support ongoing Platform Modernization initiatives Who We’re Looking For Extensive experience as a Site Reliability Engineer or in a similar role Strong background in cloud-native infrastructure Interest in automation, reliability, and scalable systems Comfort working with cross-functional engineering teams Location This position is based in San Francisco, California.
Join our innovative technology team at Twitter Inc. as a Senior Site Reliability Engineer. In this role, you will be pivotal in enhancing system reliability and performance, ensuring our services run smoothly and efficiently. We are seeking passionate engineers who thrive in a fast-paced environment and are eager to tackle challenging problems.
Full-time|$162K/yr - $249K/yr|On-site|San Francisco, California
Okta is seeking a Staff Site Reliability Engineer to join the Infrastructure Platform AGILE SRE team in San Francisco. This position centers on supporting and improving the systems that underpin Okta’s identity infrastructure. Role overview The Staff SRE will work closely with multiple teams to develop and maintain critical infrastructure. A core part of this role involves enhancing internal tools and operational processes, ensuring that Okta’s systems remain secure and reliable as the company grows. What you will do Provide cross-functional support to teams building and maintaining key infrastructure components. Collaborate with Infrastructure Operations groups to address complex technical challenges. Diagnose, troubleshoot, and resolve sophisticated infrastructure issues by developing new tools and strategic solutions. Who we’re looking for Experienced SREs who are comfortable working on large-scale, impactful projects. Engineers who enjoy collaborating across teams and disciplines. Problem-solvers who can tackle intricate technical challenges and deliver reliable solutions. This role offers the chance to contribute directly to Okta’s mission of building secure, trusted infrastructure for organizations navigating the evolving landscape of AI and identity.
Join Our TeamAt Cognition, we are at the forefront of applied AI innovation, developing cutting-edge software agents that redefine the engineering landscape. Our flagship products, Devin, the pioneering AI software engineer, and Windsurf, an AI-native IDE, embody our commitment to creating AI that collaborates with engineers as a true partner.Our team is composed of elite talent including competitive programming champions, visionary founders, and researchers from top AI institutions such as Scale AI, Palantir, Cursor, Google DeepMind, and more.Your MissionAs a Site Reliability Engineer, you will play a crucial role in ensuring the reliability of our user-focused products, which are utilized by hundreds of thousands of developers daily. Your mission is to preemptively address potential issues and swiftly resolve any incidents that may arise, maintaining a seamless experience for our users.You will be responsible for overseeing production reliability and enhancing our platform engineering practices, encompassing SLOs, incident response, and on-call duties, alongside CI/CD pipelines, deployment infrastructure, and developer tools. At Cognition, we believe in integrating reliability into our systems rather than treating it as an afterthought, and we strive to cultivate a culture that reflects this philosophy.Your AchievementsProduction Reliability: Establish and manage SLOs, SLIs, and error budgets for our products. Develop robust monitoring, alerting, and observability systems to maintain a transparent view of service health.Incident Management: Spearhead incident response with precision and promptness. Conduct blameless postmortems to derive actionable insights from outages, and create effective runbooks and tools to enhance on-call sustainability.Platform Engineering: Oversee deployment pipelines and internal developer tools, ensuring rapid, reliable shipping of code while minimizing unnecessary toil for engineers.Infrastructure as Code: Manage cloud infrastructure via code, creating reproducible, auditable environments that can scale with product demands and mitigate configuration drift.Capacity Planning: Analyze growth trends, anticipate resource requirements, and ensure our infrastructure is always ahead of user demand, optimizing system performance proactively.Security and Reliability: Integrate security protocols with reliability practices to create a robust framework that safeguards our infrastructure.
About UnifyAt Unify, we are pioneering the first AI-driven system of action for revenue teams, enabling businesses to transform their outbound strategies into high-performing growth engines. Our focus is on making go-to-market execution measurable, repeatable, and scalable. Founded in 2023 by industry veterans from Ramp and Scale AI, our talented team has diverse experience from leading organizations such as Airbnb, Meta, Waymo, and Perplexity.In 2024, Unify achieved an impressive 8x revenue growth and serves notable clients including Perplexity, Cursor, SoFi, and Justworks. We are a dynamic, high-energy team backed by $58M in funding from Thrive, Emergence, OpenAI, and others. Join us as we shape the future of GTM!About the RoleAs the Staff SRE Tech Lead at Unify, you will be instrumental in enhancing the reliability and scalability of our platform as we handle increasing volumes of data and accommodate customers with stringent uptime requirements. You will define the technical roadmap for reliability engineering, lead a dedicated team of SREs, and collaborate closely with engineering leaders to establish systems and practices that ensure Unify remains both swift and dependable at scale.
ABOUT BASETENBaseten is at the forefront of powering mission-critical AI inference for some of the most innovative companies globally, including Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer. We integrate cutting-edge applied AI research with a flexible infrastructure and intuitive developer tools to empower companies at the leading edge of AI to deploy sophisticated models effectively. With our recent $300M Series E funding round—supported by prominent investors such as BOND, IVP, Spark Capital, Greylock, and Conviction—we are rapidly expanding. Join our dynamic team and contribute to creating an essential platform for engineers to launch AI products with ease.THE ROLEAs a Site Reliability Engineer, you will design and implement resilient systems and processes that ensure our infrastructure is scalable, reliable, and efficient. Your responsibilities will encompass everything from automating deployments and monitoring systems to enhancing performance and managing incidents effectively.Collaboration is key; you will work closely with our users to understand their challenges in operationalizing machine learning, facilitating their onboarding onto our platform, and leveraging these insights to inform improvements to Baseten.EXAMPLE INITIATIVESAs part of our Infrastructure team, you will engage in exciting projects such as:Innovative multi-cloud capacity managementOptimizing inference on B200 GPUsImplementing multi-node inferenceUtilizing fractional H100 GPUs for efficient model servingRESPONSIBILITIESDesign and maintain scalable infrastructures to support the deployment and operational needs of machine learning models.Establish standards and best practices to enhance reliability and performance across the infrastructure.Proactively identify and resolve reliability issues using monitoring and alerting systems.Collaborate with cross-functional teams to apply best practices in infrastructure management and incident response.Create automation scripts to streamline processes and reduce manual intervention.
Full-time|Remote|Global Remote / San Francisco, CA
Join Andromeda as a Senior Site Reliability Engineer specializing in AI Infrastructure. In this pivotal role, you will be responsible for ensuring the reliability, scalability, and performance of our cutting-edge AI systems. Collaborate with cross-functional teams to design and implement robust infrastructure solutions that support our innovative AI initiatives. Your expertise will play a crucial role in maintaining optimal service availability and improving system performance.
About Plaud Inc.Plaud is revolutionizing the way professionals enhance productivity and performance with our trusted AI work companion. Our innovative note-taking solutions have gained the admiration of over 1,500,000 users globally since our inception in 2023. We are on a mission to amplify human intelligence by developing next-generation intelligence infrastructure and interfaces that seamlessly capture, extract, and leverage what you say, hear, see, and think.Based in San Francisco, Plaud Inc. is a Delaware-incorporated company that is redefining the boundaries of human-AI collaboration through a unique combination of hardware and software solutions. We adhere to the highest standards of data security and privacy protection, with certifications including ISO 27001, ISO 27701, GDPR, SOC 2, HIPAA, and EN 18031 compliance.Discover more about our innovative solutions by visiting https://www.plaud.ai and follow us on Instagram, X, Facebook, LinkedIn, and YouTube.Why You Should Join UsAt Plaud, you will play a pivotal role in shaping the future of human-AI interaction. Here’s what we offer:A thriving, bootstrapped company with a remarkable $250M revenue run rate achieved in just three years.An opportunity to define the next-generation paradigm for human-AI interaction.Direct exposure to cutting-edge AI tools for professionals and a chance to contribute to our global expansion.Collaborate with a passionate team that values innovation, teamwork, and customer success.Advance your career in a culture that promotes continuous learning and rapid career growth.
Full-time|On-site|Austin; San Francisco; Seattle; United States
Join MongoDB as a Senior Site Reliability Engineer specializing in Infrastructure Security. In this pivotal role, you'll be at the forefront of ensuring the reliability and security of our cloud infrastructure. Your expertise will help us to design and maintain systems that are robust, efficient, and secure, providing critical support to our engineering teams.Your responsibilities will include monitoring system performance, implementing security protocols, and troubleshooting incidents to maintain high availability. You will collaborate with cross-functional teams to enhance our security posture, ensuring that our services are resilient and secure.
Mar 26, 2026
Sign in to browse more jobs
Create account — see all 6,980 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.