Software Engineer Reliability jobs in San Francisco – Browse 5,516 openings on RoboApply Jobs

Software Engineer Reliability jobs in San Francisco

Open roles matching “Software Engineer Reliability” with location signals for San Francisco. 5,516 active listings on RoboApply Jobs.

5,516 jobs found

1 - 20 of 5,516 Jobs
Apply
companyOpenAI logo
Full-time|On-site|San Francisco

Become a vital part of the engineering teams that responsibly bring OpenAI’s transformative technologies to the world!At OpenAI, our Applied Engineering team collaborates across research, engineering, product management, and design to deliver AI solutions to both consumers and businesses. We are committed to learning from our deployments, maximizing the benefits of AI, and ensuring that this powerful technology is utilized both safely and ethically. Our priority is safety over unchecked growth.About the RoleAs OpenAI continues to expand, we are seeking seasoned engineers who excel in problem-solving to enhance the scalability of our systems. Our achievements hinge on our ability to rapidly iterate on product development while ensuring optimal performance and reliability. You will thrive in a collaborative, fast-paced environment, playing a key role in delivering our technology to millions globally, with a focus on safety and reliability. As a reliability engineer, you will lead efforts to maintain and improve the stability, scalability, and performance of our dynamic infrastructure. You will collaborate closely with cross-functional teams, including software engineers, product managers, and data scientists, to construct and sustain robust systems capable of accommodating our growing user base and workload.Your Responsibilities Include:Designing and implementing solutions to scale our infrastructure to meet increasing demands effectively.Developing and maintaining load, chaos, and synthetic testing software that enhances the reliability of systems designed by development teams.Creating and managing automation tools to streamline repetitive tasks and bolster system reliability.Overseeing the lifecycle management platform for CPU/storage, GPU, and network resources to foster efficiency and support dynamic optimization.Implementing fault-tolerant and resilient design patterns to minimize service interruptions.Establishing and maintaining service level objectives (SLOs) and service level indicators (SLIs) to ensure system reliability.Collaborating with researchers, engineers, product managers, and designers to introduce new features and research advancements to the world.Participating in an on-call rotation to address critical incidents and ensure 24/7 system availability.Your Impact: Your contributions will be essential in guaranteeing the reliability and performance of our platforms as we continue to scale our operations.

Oct 17, 2025
Apply
companyCheckr, Inc. logo
Full-time|Remote|Denver, Colorado, United States; San Francisco, California, United States

Join Checkr as a Software Engineer focusing on Reliability, where your contributions will enhance our platform's robustness and performance. You will be part of a dynamic team dedicated to building and scaling systems that support our growth and ensure outstanding service delivery to our clients.

Mar 13, 2026
Apply
companySierra logo
Full-time|On-site|San Francisco, CA

About UsAt Sierra, we are pioneering a transformative platform that empowers businesses to forge authentic customer experiences through AI technology. Headquartered in the vibrant city of San Francisco, we also boast a dynamic presence in Atlanta, New York, London, France, Singapore, and Japan.Our operations are anchored in core values that shape our culture: Trust, Customer Obsession, Craftsmanship, Intensity, and Family. These principles guide our actions and are integral to our mission.Our visionary founders, Bret Taylor and Clay Bavor, bring unparalleled expertise. Bret, currently the Board Chair of OpenAI, previously co-led Salesforce and served as CTO at Facebook, while Clay led numerous initiatives at Google, including AR/VR projects and Google Workspace.Your RoleIn your capacity as a Software Engineer on the Site Reliability team, you will play a crucial role in establishing and enhancing the reliability, observability, and scalability of Sierra’s AI-centric infrastructure. Collaborating closely with our engineering and product teams, your goal is to ensure our systems remain highly available, efficient, and primed for growth.Lead the development of Sierra’s observability stack—including monitoring, alerting, logging, and tracing—to provide engineers with critical insights into system health and performance.Collaborate with product and platform engineers to architect systems that prioritize reliability and scalability from the outset, not as an afterthought.Design and implement robust, scalable, and secure cloud infrastructure on AWS, employing Terraform and cutting-edge DevOps tools.Enhance the reliability and scalability of our LLM deployments, ensuring they operate efficiently and cost-effectively.Drive improvements in deployment pipelines, CI/CD tooling, and incident management processes to minimize downtime and accelerate response times.Define and cultivate SRE practices within Sierra, shaping culture, tooling, and best practices across the engineering organization.QualificationsBachelor's degree in Computer Science or a related field, or equivalent experience.Proven experience in Site Reliability Engineering or a similar role, with a strong understanding of cloud infrastructure (AWS).Proficiency in Terraform and modern DevOps practices.Experience with observability tools and techniques—monitoring, alerting, logging, and tracing.Strong problem-solving skills with a focus on scalability and performance optimization.Excellent collaboration and communication skills, with the ability to work effectively in a team environment.

Oct 21, 2025
Apply
companyFastly, Inc. logo
Full-time|$181.2K/yr - $217.5K/yr|On-site|Denver, CO; San Francisco, CA

At Fastly, we empower individuals to connect more effectively with the things they cherish. Our cutting-edge edge cloud platform enables customers to swiftly, securely, and reliably craft exceptional digital experiences by processing, serving, and safeguarding their applications as close to their end-users as possible — right at the edge of the Internet. Tailored for modern internet demands, our platform is programmable and supports agile software development. We proudly serve many of the world's leading companies, including GitHub, Yelp, Paramount, and JetBlue.Join us in our mission to build a more trustworthy Internet.Posting Open Date: Feb. 25, 2026Anticipated Posting Close Date*: March 25, 2026*Please note that this job posting may close early depending on the volume of applications.Role Overview:The Data Reliability team is seeking an experienced Senior Software Engineer to contribute to the development and support of next-generation data storage solutions at Fastly. The ideal candidate will possess expertise in backend and data services within cloud environments, proficiency with configuration and orchestration tools such as Terraform and Kubernetes, and the ability to create internal administration tools using Go and related technologies. Our team plays a vital role in ensuring the infrastructure, orchestration, and reliability of Fastly's most data-intensive applications, utilizing technologies like Terraform, Elasticsearch, ClickHouse, Prometheus, MySQL, and Redis across both cloud and hardware platforms. Your contributions will directly enhance our customers' success by providing product teams with a robust platform for efficient and consistent delivery of high-quality, high-throughput, globally distributed data systems and products. We embrace a distributed work model and value both collaborative and asynchronous communication styles.Key Responsibilities:Deploy, support, and maintain various critical data storage systems, scaling from gigabytes to petabytes.Develop statistics and dashboards to track service-level objectives for these systems.Create and manage tools for configuration, backup, and authenticated access to data systems employing peer review, CI/CD, and both daemon- and container-based deployment strategies.Write high-performance, maintainable, and concise code, actively participating in code reviews to enhance the codebase.

Mar 20, 2026
Apply
companySigma Computing logo
Full-time|$170K/yr - $240K/yr|On-site|San Francisco, CA

About the Role Sigma Computing is growing its engineering team in San Francisco, CA. The company builds technology to help users access data with ease. As a Senior Software Engineer focused on Observability and Reliability, you will work alongside engineers who value high standards and collaboration. What You Will Do Design and build observability platforms and tools, including metrics collection, logging, distributed tracing, dashboards, alerting, and application performance monitoring. Work with technologies such as Go, OpenTelemetry, and Kubernetes to solve reliability challenges. Take part in on-call rotations to help maintain strong uptime for Sigma’s services. Create tools and processes to improve cloud incident triage and reduce downtime. Define and promote practices that make systems and services measurable and observable. Join design and code reviews with peers and stakeholders to reinforce quality and effective collaboration.

Apr 25, 2026
Apply
companyOpenAI logo
Full-time|On-site|San Francisco

About Our TeamJoin our dynamic Infrastructure organization at OpenAI, where we are actively seeking talented software engineers to bolster our efforts across several high-impact teams. With a variety of focus areas available—including Core Distributed Systems, Databases, Observability, and Cloud Infrastructure—you'll have the opportunity to work on projects that fascinate you. Our teams operate with a high level of autonomy and foster a deeply collaborative environment, all dedicated to enhancing safety, reliability, and operational velocity across the organization.About the RoleAs a Software Engineer focused on Infrastructure Reliability, you will play a pivotal role in scaling and fortifying the infrastructure that supports some of the world’s most widely utilized AI systems. Your work will ensure that our systems maintain high reliability, observability, performance, and security—enabling researchers to iterate rapidly and allowing products like ChatGPT and the OpenAI API to effectively serve millions of users.This hands-on, impactful role is perfect for engineers who enjoy ownership, excel at solving complex technical challenges across the entire stack, and wish to contribute to systems that facilitate cutting-edge research deployed on a global scale. You will significantly influence technical direction, enhance system resilience, and collaborate closely with infrastructure, product, and research teams to transform intricate infrastructure into dependable platforms.Key ResponsibilitiesDesign, construct, and maintain reliable, high-performance systems utilized across engineering.Identify and resolve performance bottlenecks and inefficiencies, ensuring our infrastructure scales appropriately.Investigate and troubleshoot complex issues thoroughly.Enhance automation to minimize manual tasks and improve internal developer tools.Participate in incident response, postmortem analysis, and the development of best practices surrounding system reliability and scalability.Ideal Candidate ProfilePossess a deep understanding of distributed systems principles, with a proven track record in developing and managing scalable, reliable systems.Demonstrate a strong focus on performance and optimization, with the ability to maximize efficiency in complex, globally distributed systems.Have experience managing orchestration systems such as Kubernetes at scale and creating abstractions over cloud platforms.Be comfortable working within Linux environments and possess strong problem-solving skills.

Mar 19, 2026
Apply
companyunify logo
Full-time|On-site|San Francisco Office

Join unify as a Staff Backend Engineer specializing in Reliability. In this pivotal role, you will be responsible for designing, developing, and maintaining backend systems that ensure the reliability and performance of our services. Collaborate with cross-functional teams to implement robust solutions and drive continuous improvement initiatives.

Mar 24, 2026
Apply
companyHarvey logo
Full-Time|On-site|San Francisco

Why Join Harvey?At Harvey, we are revolutionizing the landscape of legal and professional services with a holistic approach. By integrating advanced AI technology, a robust enterprise platform, and extensive industry knowledge, we are redefining how essential knowledge work is conducted for years to come.This is a unique opportunity to contribute to the foundation of a transformative company at a pivotal moment in its journey. With over 1000 clients across more than 58 countries, a solid product-market fit, and outstanding investor backing, we are rapidly expanding and creating a new category in real-time. The challenges are significant, expectations are high, and the potential for personal, professional, and financial development is unparalleled.Our team comprises driven, intelligent individuals who are deeply passionate about our mission. We prioritize speed, intensity, and accountability in addressing challenges — from initial ideation to long-term solutions. By maintaining close relationships with our clients, from executives to engineers, we collaboratively address pressing issues with urgency and care. If you excel in uncertain environments, strive for excellence, and wish to shape the future of work alongside a team that raises the bar, we invite you to build alongside us.At Harvey, we are currently writing the future of professional services — and we are just getting started.Your RoleAs a Senior Software Engineer on the Site Reliability team at Harvey, your mission will be to uphold the reliability, scalability, and performance of our innovative legal AI platform. You will become part of a high-impact team that operates at the crossroads of infrastructure and product, taking ownership of the systems that ensure our platform remains fast, secure, and continuously available. From scaling operations across 50+ regions to automating critical processes, your efforts will fortify Harvey's resilience as we expand. If you are enthusiastic about constructing robust systems and simplifying complexity through automation, we would love to collaborate with you.This position is situated in San Francisco, CA, and we adhere to an in-person work model, providing relocation assistance to new employees.Your ResponsibilitiesDesign, implement, and oversee monitoring, alerting, and infrastructure resources (compute, storage, networking) across 50+ global regions.Lead incident management processes, including postmortems, root cause analyses, and driving actionable enhancements.Automate operational tasks and workflows by developing tools and processes for capacity planning, seamless rollouts, and secure data access to maintain high reliability and minimize manual intervention.Collaborate across teams to drive solutions that enhance system performance and reliability.

Dec 1, 2025
Apply
companyAbridge logo
Full-time|On-site|SF Office

About AbridgeFounded in 2018, Abridge is dedicated to enhancing understanding in the healthcare sector. Our innovative AI-powered platform is specifically designed to enhance medical conversations, streamlining clinical documentation while allowing healthcare providers to prioritize what matters most—their patients.Our enterprise-grade technology revolutionizes patient-clinician dialogues by converting them into structured clinical notes in real-time, with integrated EMR functionalities. Utilizing Linked Evidence and our auditable AI, we uniquely map AI-generated summaries to verified ground truth, fostering quick trust among providers. As trailblazers in generative AI for healthcare, we are establishing industry benchmarks for the responsible integration of AI within health systems.Our diverse team comprises practicing MDs, AI scientists, PhDs, creatives, technologists, and engineers, all united in empowering individuals and simplifying care. Our offices are situated in San Francisco's Mission District, New York's SoHo neighborhood, and East Liberty in Pittsburgh.The RoleAs part of our rapidly scaling services and engineering team, we are seeking seasoned Site Reliability Engineers (SREs) to enhance our software's performance, stability, and scalability significantly. This role focuses primarily on distributed systems, with approximately 80% dedicated to software and 20% to cloud infrastructure.You will play a pivotal role in integrating load testing and chaos engineering into our CI pipelines. You will utilize observability and profiling tools to pinpoint and rectify performance bottlenecks, collaborate with various teams to transition their applications to more scalable infrastructures, and ensure a seamless experience as we expand our application adoption in the healthcare domain. This may include embedding with other teams for extended periods.The platform we are developing must optimize both engineering speed and security, facing significant scale challenges and presenting numerous opportunities to exercise creativity, independence, and leadership in taking projects from inception to fruition. This is a rare chance to advance your career in a rapidly growing company that harnesses cutting-edge technologies.What You'll DoUtilize load testing, chaos engineering, and other testing methodologies to uncover performance and latency issues across all systems, implementing code changes to resolve them.Lead software modifications that facilitate the migration of applications at the code level to new infrastructures (including run times, event-driven frameworks, databases, etc.).

May 21, 2025
Apply
companyVeeam Software logo
Full-time|On-site|San Francisco Bay, CA, USA

Join Veeam Software as a Site Reliability Engineer III, where you'll be at the forefront of ensuring the reliability, scalability, and performance of our software solutions. You will leverage your expertise in system administration and programming to improve our infrastructure and automate processes, making Veeam a leader in cloud data management.

Mar 22, 2026
Apply
companyHarvey logo
Full-Time|On-site|San Francisco

Why Join Harvey?At Harvey, we're not just changing the landscape of legal and professional services; we're revolutionizing it from the ground up. By integrating cutting-edge AI technology with an enterprise-level platform and profound domain knowledge, we're setting new standards for how knowledge work is conducted for generations to come.This is a unique opportunity to be a part of a transformative journey at a pivotal moment for our company. With over 1,000 clients in more than 58 countries, a robust product-market fit, and exceptional investor backing, we are rapidly scaling and defining a new industry standard. The challenges are ambitious, the expectations are high, and the potential for personal, professional, and financial growth is unparalleled.Our team is composed of sharp, driven individuals who are deeply aligned with our mission. We operate with agility and intensity, taking ownership of the challenges we face—from initial brainstorming to long-term solutions. We engage closely with our clients, from executive leaders to engineers, collaborating to swiftly address real-world challenges with urgency and diligence. If you excel in uncertain environments, strive for excellence, and want to shape the future of work alongside high achievers, we encourage you to join our mission.At Harvey, we are writing the future of professional services today—and we’re just getting started.Role OverviewAs a Staff Software Engineer on our Site Reliability Engineering (SRE) team, you will play a crucial role in ensuring the reliability, scalability, and performance of our legal AI platform. You'll be part of a dynamic team that bridges infrastructure and product, taking ownership of systems that guarantee our platform is fast, secure, and consistently operational. Your efforts will be pivotal in scaling our operations across over 50 regions and in automating essential operational tasks. If you are enthusiastic about creating resilient systems and simplifying processes through automation, we would love to have you on board.This position is based in San Francisco, CA, and we follow an in-person work model, offering relocation assistance to new hires.

Dec 1, 2025
Apply
companyGridware logo
Full-time|On-site|San Francisco, CA

About GridwareGridware is an innovative technology firm headquartered in San Francisco, committed to safeguarding and enhancing the reliability of the electrical grid. We have pioneered a revolutionary approach to grid management known as Active Grid Response (AGR), which meticulously monitors the electrical, physical, and environmental factors influencing grid safety and reliability. Our state-of-the-art AGR platform leverages high-precision sensors to identify potential issues at an early stage, facilitating proactive maintenance and fault resolution. This holistic strategy is designed to bolster safety, minimize outages, and ensure optimal grid performance. We are proud to be supported by prominent climate-tech and Silicon Valley investors. To learn more, visit www.Gridware.io.About the RoleWe are seeking a skilled Senior Hardware Reliability Engineer to lead reliability testing, analysis, and lifetime modeling of various outdoor electronic assemblies. This pivotal role will concentrate on the electronic components of our products, collaborating closely with our mechanical-focused Reliability Engineer and engaging with the broader hardware and cross-functional teams.

Feb 21, 2026
Apply
companyMultiply Labs logo
Full-time|On-site|San Francisco

About Multiply LabsMultiply Labs is an innovative startup located in San Francisco, California, backed by renowned investors in technology and life sciences such as Casdin Capital, Lux Capital, and Y Combinator. Our goal is to develop the world's leading robotic systems and utilize them to make groundbreaking life-saving therapies accessible to everyone.We are transforming the manufacturing process of cell therapies through the creation of advanced robotic systems that automate and scale the production of these crucial treatments. Our cutting-edge robots enable biopharma companies to produce cell therapies efficiently without overhauling their existing processes, thus minimizing regulatory hurdles and risks. Unlike traditional methods that are labor-intensive and costly (often exceeding $1M per patient), our robotic solutions aim to make these vital treatments more affordable and reachable for those who need them.To discover more and view our robots in action, please visit www.multiplylabs.com and follow us on LinkedIn.Position OverviewWe are looking for a dedicated Hardware Reliability Engineer to become an essential part of Multiply Labs’ Reliability Engineering team. As a founding member, you will collaborate closely with the Hardware Product and Systems Integration teams to enhance our designs throughout the entire development lifecycle, from initial prototypes to fully deployed GMP production systems. Your contributions will directly support the delivery of life-saving therapies by ensuring our robots operate seamlessly within the high-stakes biotech environment.

Jan 28, 2026
Apply
companyOpenAI logo
Full-time|On-site|San Francisco

Join Our Innovative TeamAt OpenAI, our Hardware organization is pioneering cutting-edge silicon and system-level solutions tailored to meet the demands of advanced AI workloads. We pride ourselves on developing next-generation AI-native silicon while collaborating with software and research partners to create hardware that is intricately integrated with AI models. Our mission includes delivering high-performance silicon for OpenAI’s supercomputing infrastructure and designing custom tools and methodologies that accelerate innovations, specifically optimized for AI applications.Your Role in Our MissionWe are on the lookout for a dynamic and experienced Reliability/DFX Engineer who possesses extensive knowledge in scaling machine learning systems. As an integral member of our hardware team, you will collaborate with chip design, platform design, hardware health, and the wider industry ecosystem to architect, implement, and deploy dependable next-generation AI accelerator systems. You will take a holistic approach to evaluate system and chip architecture, pinpointing high-ROI opportunities that enhance reliability and availability throughout the stack while translating these insights into actionable strategies and silicon features.Key Responsibilities:Lead the architecture, implementation, and execution of DFX strategies in silicon from concept to high-volume deployment, proposing impactful features to boost reliability and fault tolerance. Your focus will encompass design for testability, reliability, availability, and serviceability of high-performance AI hardware.Develop system-level reliability models based on empirical data to guide the organization’s DFX and reliability strategy, necessitating a deep understanding of chip and system architecture, design, implementation, and component-level reliability.Collaborate with chip and platform architecture/design teams to explore and implement DFX features, including the specification and integration of digital/mixed-signal IP, firmware/system software, and DFX methodologies.Work alongside hardware health and platform design teams to enhance reliability and fault tolerance in New Product Introduction (NPI) and High-Volume Manufacturing (HVM), driving continuous, data-driven improvements across the stack through optimized operating conditions and data analysis.Act as the DFX/reliability advocate, aligning the broader industry ecosystem with OpenAI’s strategic objectives and roadmap.Qualifications:Bachelor’s degree in Engineering or related field with 15+ years of experience, or a Master’s degree with 10+ years of relevant experience.Proven expertise in DFX methodologies and reliability engineering for high-performance hardware.Strong analytical and problem-solving skills, with a track record of improving system reliability and performance.Excellent collaboration and communication abilities, capable of working effectively in a cross-functional team environment.Familiarity with AI workloads and associated hardware requirements is highly desirable.

Sep 17, 2025
Apply
companyCloudflare, Inc. logo
Full-time|Hybrid|Hybrid

Join Cloudflare as a Database Reliability Engineer, where you will play a crucial role in ensuring the reliability and performance of our database systems. You will work collaboratively with our engineering teams to develop, implement, and maintain robust database solutions that support our mission of making the internet safer and faster.Your responsibilities will include monitoring database performance, troubleshooting issues, and optimizing queries to enhance system efficiency. If you are passionate about databases and eager to make an impact in a dynamic environment, we encourage you to apply!

Feb 6, 2026
Apply
companyAstranis Space Technologies Corp. logo
Senior Reliability Test Engineer

Astranis Space Technologies Corp.

Full-time|$130K/yr - $180K/yr|On-site|San Francisco

Astranis is at the forefront of satellite technology, crafting advanced satellites designed for high orbits to broaden humanity's exploration of the solar system. Our satellites deliver dedicated, secure networks to a diverse range of esteemed clients worldwide, including large enterprises, government entities, and the US military. With five satellites currently operational and several more set to launch, we are addressing a robust backlog of over $1 billion in commercial contracts.We take pride in being the leading choice for satellite communications among clients with demanding standards for uptime, data security, network visibility, and customization. Having secured over $750 million from top-tier investors such as Andreessen Horowitz, Blackrock, and Fidelity, our team of 450 engineers and entrepreneurs operates from our expansive 153,000 sq. ft. headquarters in Northern California, USA.Senior Reliability Test EngineerAs a Senior Reliability Test Engineer, you will play a pivotal role in collaborating across all engineering disciplines to ensure our hardware achieves exceptional quality and reliability standards. With Astranis ramping up satellite production, your expertise will be essential in establishing a comprehensive reliability test program that supports the development of new product designs, monitors manufacturing processes, and identifies long-term reliability issues. The ideal candidate will possess extensive engineering experience with high-reliability products, demonstrate autonomy, and have the capability to design a reliability test program from the ground up.

Mar 9, 2026
Apply
companyAstranis logo
Full-time|$135K/yr - $235K/yr|On-site|San Francisco

Astranis is revolutionizing satellite technology by creating advanced spacecraft designed for high orbits, thereby extending humanity's presence in the solar system. Our satellites deliver dedicated and secure networks to an elite clientele, including large corporations, government entities, and the U.S. military. With five satellites successfully launched and a robust pipeline of over $1 billion in commercial contracts, Astranis is set for growth as we prepare for numerous upcoming launches.We are the go-to satellite communications partner for clients demanding exceptional uptime, data security, network visibility, and tailored solutions. Backed by over $750 million from industry-leading investors such as Andreessen Horowitz, Blackrock, and Fidelity, our team of 450 engineers and entrepreneurs thrives in our 153,000 sq. ft. headquarters in Northern California.Senior Electrical Reliability EngineerAs a Senior Reliability Engineer at Astranis, you will be pivotal in ensuring that our spacecraft electronics and systems fulfill our reliability and availability requirements. Collaborating with a multidisciplinary engineering team, you will push the boundaries of geo-synchronous spacecraft design and achieve previously unattainable performance in space. Your expertise will ensure that Design for Reliability remains central to our engineering efforts.

Mar 18, 2026
Apply
companyDrata logo
Full-time|$166.9K/yr - $225.9K/yr|Hybrid|Hybrid - San Francisco

Drata helps organizations demonstrate their commitment to security and integrity. The platform supports companies as they build and maintain trust with users, customers, partners, and prospects. Values Built on Trust: Consistency shapes decisions and actions. Integrity: Choosing to do what is right, every time. Customer-Obsessed: Prioritizing customer needs above all else. Competitive Fire: Striving for higher standards and greater achievements. Diversity: Welcoming different perspectives to encourage creative solutions. Automation First: Pursuing efficiency by saving time and resources wherever possible. How the Team Works Drata blends high standards with a supportive environment focused on growth. Team members are encouraged to own their work, improve continuously, and deliver meaningful results. The company values quick, informed decisions that drive immediate impact, while always keeping the mission and customer needs at the center. The San Francisco-based team uses a hybrid work model. Colleagues collaborate in the office Tuesday through Thursday, focusing on alignment and innovation. Mondays and Fridays offer flexibility for deep work or personal needs. Growth and Culture Drata has expanded to over 600 professionals worldwide, recognized for a culture that values trust, speed, and continuous learning. The environment supports both personal and professional development. See the Speed: CEO Adam Markowitz discusses Drata’s rapid journey to $100M ARR in four years. Hear the Voice of the Team: Employee stories highlight collaboration and growth at Drata.

Apr 27, 2026
Apply
companyRedwood Materials logo
Full-time|On-site|San Francisco, California, United States

We are seeking a talented and motivated Reliability Engineer to join our innovative team at Redwood Materials. In this role, you will be responsible for ensuring the reliability and performance of our cutting-edge energy storage systems. You will collaborate with cross-functional teams to develop and implement reliability engineering strategies that enhance product performance and longevity.

Mar 25, 2026
Apply
companyCognition logo
Full-time|On-site|San Francisco Bay Area

Join Our TeamAt Cognition, we are at the forefront of applied AI innovation, developing cutting-edge software agents that redefine the engineering landscape. Our flagship products, Devin, the pioneering AI software engineer, and Windsurf, an AI-native IDE, embody our commitment to creating AI that collaborates with engineers as a true partner.Our team is composed of elite talent including competitive programming champions, visionary founders, and researchers from top AI institutions such as Scale AI, Palantir, Cursor, Google DeepMind, and more.Your MissionAs a Site Reliability Engineer, you will play a crucial role in ensuring the reliability of our user-focused products, which are utilized by hundreds of thousands of developers daily. Your mission is to preemptively address potential issues and swiftly resolve any incidents that may arise, maintaining a seamless experience for our users.You will be responsible for overseeing production reliability and enhancing our platform engineering practices, encompassing SLOs, incident response, and on-call duties, alongside CI/CD pipelines, deployment infrastructure, and developer tools. At Cognition, we believe in integrating reliability into our systems rather than treating it as an afterthought, and we strive to cultivate a culture that reflects this philosophy.Your AchievementsProduction Reliability: Establish and manage SLOs, SLIs, and error budgets for our products. Develop robust monitoring, alerting, and observability systems to maintain a transparent view of service health.Incident Management: Spearhead incident response with precision and promptness. Conduct blameless postmortems to derive actionable insights from outages, and create effective runbooks and tools to enhance on-call sustainability.Platform Engineering: Oversee deployment pipelines and internal developer tools, ensuring rapid, reliable shipping of code while minimizing unnecessary toil for engineers.Infrastructure as Code: Manage cloud infrastructure via code, creating reproducible, auditable environments that can scale with product demands and mitigate configuration drift.Capacity Planning: Analyze growth trends, anticipate resource requirements, and ensure our infrastructure is always ahead of user demand, optimizing system performance proactively.Security and Reliability: Integrate security protocols with reliability practices to create a robust framework that safeguards our infrastructure.

Oct 13, 2025

Sign in to browse more jobs

Create account — see all 5,516 results

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.