Reliability Engineer At Sieve San Francisco jobs in San Francisco – Browse 11,381 openings on RoboApply Jobs

Reliability Engineer at Sieve | San Francisco

SieveSan Francisco

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Entry Level

Qualifications

Candidates should possess a robust background in reliability engineering, cloud infrastructure, and incident management. Familiarity with video data and a passion for innovative AI solutions will be advantageous.

About the job

About Sieve

Sieve stands as a pioneering AI research lab dedicated solely to video data. Our innovative approach integrates exabyte-scale video infrastructure with state-of-the-art video understanding techniques and a myriad of data sources, creating unparalleled datasets that redefine video modeling. With video accounting for 80% of global internet traffic, it has become the vital digital medium fueling creativity, communication, gaming, AR/VR, and robotics. At Sieve, we aim to eliminate the most significant bottleneck hindering the expansion of these applications: access to high-quality training data.

With strategic partnerships with leading AI labs, our team of just 12 has achieved remarkable financial success, generating $XXM last quarter alone. Earlier this year, we secured Series A funding from elite firms including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.

About the Role

As we process petabytes of video across numerous nodes and cloud environments, ensuring reliability, observability, and security is essential to our growth.

We are seeking our inaugural Reliability Engineer, who will focus entirely on fortifying the infrastructure that underpins Sieve. This role demands high ownership and a deep understanding of:

System throughput and stability
Monitoring and incident management
Security principles, including least-privilege design
Minimizing operational burdens for the entire engineering team

You will collaborate closely with our CTO and founding engineers to develop the foundational tools that empower our engineering efforts.

This position is ideal for an engineer who is passionate about reliability, throughput, observability, and security. You are proactive in anticipating potential failure modes, reducing operational risks, and designing resilient systems.

If a system failure occurs, you take it personally, thriving under the weight of responsibility.

What You'll Be Doing

Collaborate with engineering to design and validate infrastructure supporting PB-scale workloads
Develop and manage Terraform-based multi-cloud deployments
Enhance cloud and data security (SSO, IAM, least privilege access, auditability)
Lead incident response efforts and strengthen systems against failures
Create CI/CD systems to minimize user errors and maximize safety
Establish monitoring and alerting frameworks (Prometheus, OpenTelemetry, VictoriaMetrics)

About Sieve

Sieve is at the forefront of AI research, dedicated to harnessing the power of video data to create groundbreaking solutions that enhance the digital landscape. Join us as we redefine video modeling and contribute to transformative applications across various industries.

Similar jobs

1 - 20 of 11,381 Jobs

Select all on this page (20)

Apply

Reliability Engineer at Sieve | San Francisco

Sieve

Full-time|On-site|San Francisco

About SieveSieve stands as a pioneering AI research lab dedicated solely to video data. Our innovative approach integrates exabyte-scale video infrastructure with state-of-the-art video understanding techniques and a myriad of data sources, creating unparalleled datasets that redefine video modeling. With video accounting for 80% of global internet traffic, it has become the vital digital medium fueling creativity, communication, gaming, AR/VR, and robotics. At Sieve, we aim to eliminate the most significant bottleneck hindering the expansion of these applications: access to high-quality training data.With strategic partnerships with leading AI labs, our team of just 12 has achieved remarkable financial success, generating $XXM last quarter alone. Earlier this year, we secured Series A funding from elite firms including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleAs we process petabytes of video across numerous nodes and cloud environments, ensuring reliability, observability, and security is essential to our growth.We are seeking our inaugural Reliability Engineer, who will focus entirely on fortifying the infrastructure that underpins Sieve. This role demands high ownership and a deep understanding of:System throughput and stabilityMonitoring and incident managementSecurity principles, including least-privilege designMinimizing operational burdens for the entire engineering teamYou will collaborate closely with our CTO and founding engineers to develop the foundational tools that empower our engineering efforts.This position is ideal for an engineer who is passionate about reliability, throughput, observability, and security. You are proactive in anticipating potential failure modes, reducing operational risks, and designing resilient systems.If a system failure occurs, you take it personally, thriving under the weight of responsibility.What You'll Be DoingCollaborate with engineering to design and validate infrastructure supporting PB-scale workloadsDevelop and manage Terraform-based multi-cloud deploymentsEnhance cloud and data security (SSO, IAM, least privilege access, auditability)Lead incident response efforts and strengthen systems against failuresCreate CI/CD systems to minimize user errors and maximize safetyEstablish monitoring and alerting frameworks (Prometheus, OpenTelemetry, VictoriaMetrics)

Feb 5, 2026

Apply

Software Engineering Intern at Sieve | San Francisco

Sieve

Internship|On-site|San Francisco

Sieve is an AI research lab based in San Francisco, focused on large-scale video data. The team builds and maintains infrastructure capable of handling exabyte-scale video, while developing advanced methods for video understanding. Sieve also curates extensive and diverse datasets to support new approaches in video modeling. With video now accounting for a significant portion of internet traffic, the company aims to remove barriers to high-quality training data for fields like creativity, communication, gaming, AR/VR, and robotics. The company recently completed its Series A and is backed by investors including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant. Sieve has worked with leading AI labs and continues to grow, generating $XXM last quarter with a small team of 15. Role overview The Software Engineering Intern at Sieve works across the stack to help build and scale data pipelines that deliver video datasets to clients. This role involves taking full ownership of projects, from sourcing and curating video data to developing machine learning filters, improving system efficiency, and building internal dashboards for quality assurance and delivery. The work directly impacts the timely delivery of high-quality data to customers. What you will do Build and scale data pipelines for delivering video datasets Source and curate video data for a range of applications Develop and refine machine learning filters to enhance data quality Improve system efficiency and reliability Create internal dashboards for quality assurance and delivery tracking Who thrives here This internship is well suited to those who enjoy solving complex technical problems, want to collaborate directly with customers at the forefront of Video AI, and seek meaningful challenges in a high-performance setting.

Apr 28, 2026

Apply

Product Engineer at Sieve | San Francisco

Sieve

Full-time|On-site|San Francisco

About UsAt Sieve, we are pioneering the future of AI by being the only research lab solely dedicated to video data. Our unique approach integrates exabyte-scale video infrastructure, cutting-edge video comprehension techniques, and a multitude of data sources to create datasets that redefine the boundaries of video modeling. With video constituting 80% of internet traffic, it has become the essential digital medium for creativity, communication, gaming, AR/VR, and robotics. Our mission is to address the critical challenge hindering the growth of these applications: acquiring high-quality training data.Having partnered with leading AI laboratories, our team of just 15 individuals generated $XX million last quarter, and we successfully secured Series A funding from prestigious firms such as Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleAs a Product Engineer at Sieve, you will be at the forefront of developing our video collection platform: a system designed to compensate contributors for submitting video recordings.You'll engage in full-stack development, covering frontend, backend, systems, and mobile, to swiftly deliver products, enhance reliability, and scale the platform to support millions of users. Additionally, you will create internal tools that enable the team to efficiently source, review, and deliver high-quality data. You will take ownership of projects from start to finish.This position is perfect for individuals who thrive in a high-ownership environment, enjoy building products with rapid feedback loops, and want to contribute directly to the core engine driving advanced video AI.What You’ll DoDevelop and scale a production product used by contributors, partners, and internal teams.Deliver full-stack features encompassing frontend, backend APIs, and supporting systems.Implement mobile features that facilitate easy video capture and submission.Enhance reliability, performance, and developer velocity across the platform.Create internal tools and workflows to ensure operations, QA, and delivery are swift and consistent.Collaborate closely with customers and operators to translate requirements into product features and iterate on them.RequirementsProficient full stack engineer capable of taking projects from concept to production.Expertise in TypeScript with modern frontend experience (React/Next.js preferred).Experience with mobile development in React Native, particularly for media capture or upload workflows.Backend development experience in Python or Go, with a solid understanding of API and database fundamentals.Comfortable working across both product and systems domains.

Jan 27, 2026

Apply

Software Engineer at Sieve | San Francisco

Sieve

Full-time|On-site|San Francisco

About UsSieve is a pioneering AI research lab dedicated solely to harnessing the potential of video data. Our innovative approach combines massive exabyte-scale video infrastructure with cutting-edge video understanding techniques and a multitude of data sources, allowing us to create datasets that redefine video modeling. Given that video comprises 80% of internet traffic, it serves as a vital digital medium fueling creativity, communication, gaming, AR/VR, and robotics. At Sieve, we aim to eliminate the primary bottleneck hindering the growth of these applications: the need for high-quality training data.In just a small team of 15, we've collaborated with leading AI labs and generated $XXM in revenue last quarter. Our growth has been supported by our Series A funding from top-tier firms such as Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleAs a Software Engineer at Sieve, you will play a pivotal role in developing and scaling the data pipelines that produce the datasets we provide to our customers. You will take full ownership of projects from inception to deployment: managing data sourcing and curation, developing machine learning filters, enhancing system efficiency, and creating internal dashboards for quality assurance and delivery. Your contributions will be essential in ensuring that our customers receive timely and high-quality data consistently.This position is perfect for individuals who excel at tackling challenging problems, enjoy direct customer engagement, and aspire to push the boundaries of Video AI technology.

Sep 2, 2025

Apply

Applied Research Engineer at Sieve | San Francisco

Sieve

Full-time|On-site|San Francisco

Join Our Pioneering TeamAt Sieve, we are trailblazers in the realm of AI research, specifically dedicated to harnessing the power of video data. Our cutting-edge infrastructure processes exabyte-scale video, utilizing innovative video understanding methodologies, and integrating diverse data sources to create groundbreaking datasets that redefine video modeling. With video accounting for a staggering 80% of global internet traffic, it stands as the cornerstone of digital creativity, communication, gaming, AR/VR, and robotics. Our mission is to eliminate the primary barrier to the growth of these technologies: the scarcity of high-quality training data.Having collaborated with leading AI laboratories, we achieved $XXM in revenue last quarter alone with a compact team of just 15 talented individuals. Our successful Series A funding round last year, backed by prestigious firms such as Matrix Partners, Swift Ventures, Y Combinator, and AI Grant, underscores our potential for exponential growth.The Role You’ll PlayAs an Applied Research Engineer at Sieve, you will be instrumental in constructing high-performance building blocks and expansive pipelines to achieve high-precision video comprehension at internet scale. Your role will often involve tackling ambiguous research challenges and devising ingenious solutions. You will engage with domains including computer vision, audio processing, and text processing.The ideal candidate will possess a strong command of models and APIs, leveraging innovative pre/post-processing techniques, parallelism, pipelining, inference optimization, and occasional fine-tuning to maximize performance.

Apr 26, 2025

Apply

Applied Research Engineering Intern

Sieve

Full-time|On-site|San Francisco

Sieve is a 15-person AI research lab in San Francisco focused on video data. The team builds exabyte-scale video infrastructure and develops new approaches for video understanding, drawing from diverse data sources to create advanced datasets. With video now accounting for most internet traffic, Sieve aims to solve the challenge of delivering high-quality training data for applications in creativity, communication, gaming, AR/VR, and robotics. The company partners with leading AI labs and has achieved strong financial results, backed by Series A funding from Matrix Partners, Swift Ventures, Y Combinator, and AI Grant. Internship overview The Applied Research Engineering Intern will help build high-performance components and large-scale pipelines to advance video understanding at internet scale. This role involves tackling ambiguous research problems and turning them into practical solutions. Projects often cover computer vision, audio processing, and text processing. What you will do Develop and optimize models and APIs for video, audio, and text data Improve performance through pre- and post-processing, parallelism, pipelining, and inference optimization Occasionally fine-tune models for specific tasks Work through open-ended research challenges with a small, focused team Who succeeds here Comfortable working with machine learning models and APIs Skilled at optimizing systems for speed and accuracy Enjoys solving ambiguous technical problems across computer vision, audio, and text domains

Apr 28, 2026

Apply

Senior Site Reliability Engineer at Drata | San Francisco

Drata

Full-time|$166.9K/yr - $225.9K/yr|Hybrid|Hybrid - San Francisco

Drata helps organizations demonstrate their commitment to security and integrity. The platform supports companies as they build and maintain trust with users, customers, partners, and prospects. Values Built on Trust: Consistency shapes decisions and actions. Integrity: Choosing to do what is right, every time. Customer-Obsessed: Prioritizing customer needs above all else. Competitive Fire: Striving for higher standards and greater achievements. Diversity: Welcoming different perspectives to encourage creative solutions. Automation First: Pursuing efficiency by saving time and resources wherever possible. How the Team Works Drata blends high standards with a supportive environment focused on growth. Team members are encouraged to own their work, improve continuously, and deliver meaningful results. The company values quick, informed decisions that drive immediate impact, while always keeping the mission and customer needs at the center. The San Francisco-based team uses a hybrid work model. Colleagues collaborate in the office Tuesday through Thursday, focusing on alignment and innovation. Mondays and Fridays offer flexibility for deep work or personal needs. Growth and Culture Drata has expanded to over 600 professionals worldwide, recognized for a culture that values trust, speed, and continuous learning. The environment supports both personal and professional development. See the Speed: CEO Adam Markowitz discusses Drata’s rapid journey to $100M ARR in four years. Hear the Voice of the Team: Employee stories highlight collaboration and growth at Drata.

Apr 27, 2026

Apply

Site Reliability Engineer at Mercor | San Francisco

Mercor

Full-time|On-site|San Francisco

Join the Mercor TeamAt Mercor, we stand at the dynamic intersection of labor markets and AI research. Collaborating with premier AI labs and enterprises, we empower the human intelligence that is crucial for AI's evolution.Our expansive talent network plays a vital role in training cutting-edge AI models, akin to the way educators impart knowledge to their students—by sharing insights, experiences, and contextual understanding that code alone cannot convey. Currently, our network of over 30,000 experts generates more than $2 million daily.We are pioneering a novel category of work where expertise fuels AI progress. Achieving this vision necessitates an ambitious, fast-paced, and deeply dedicated team. You will collaborate with researchers, operators, and AI firms that are at the forefront of transforming societal structures.Mercor is a thriving Series C company with a valuation of $10 billion. We operate five days a week in-person at our new headquarters in San Francisco.About the RoleAs a Site Reliability Engineer (SRE) at Mercor, you will take ownership of production reliability for our critical systems, working closely with our infrastructure leadership. You will play a pivotal role in establishing our SRE function and defining how Mercor manages large-scale, high-availability systems.Your ResponsibilitiesEnsure the reliability and safety of production for key shared services and customer-facing systems.Collaborate directly with infrastructure leadership to outline SRE priorities, reliability benchmarks, and the production safety roadmap.Enhance the structure of our production systems to ensure stability, resource efficiency, isolation, and observability.Advocate for and implement modern SRE methodologies (e.g., incident management, postmortems, SLIs/SLOs) across engineering teams.Work alongside engineering and applied AI teams to facilitate sustainable growth.Promote SRE best practices internally, supporting teams in a safe, scalable, and consistent production onboarding process.Who We SeekThe ideal candidate will have:Extensive experience in genuine SRE roles (not merely operations) across various positions or organizations.A deep understanding of SRE methodologies popularized by Google (e.g., error budgets, reliability vs. risk trade-offs, large-scale distributed systems).5+ years of SRE experience; ideally, 15+ years in total experience for this inaugural SRE position.A proven track record of managing systems at scale, with a strong grasp of the complexities involved.

Dec 27, 2025

Apply

Site Reliability Engineer at Superhuman | San Francisco

Superhuman, Inc.

Full-time|$214K/yr - $260K/yr|Hybrid|Hub - San Francisco

At Superhuman, we embrace a vibrant hybrid work model that offers our team members the ideal blend of focused individual work and collaborative in-person interactions, fostering trust, innovation, and a robust team culture.About SuperhumanSuperhuman, the AI productivity platform, is on a transformative mission to unlock the superhuman potential within everyone. With the integration of Grammarly's writing assistance and innovative tools like Coda’s collaborative workspaces and Go, our proactive AI assistant, we empower over 40 million individuals and 50,000 organizations globally. Founded in 2009, we strive to eliminate busywork and enhance productivity. Discover more at superhuman.com and explore our values here.The OpportunityTo meet our ambitious goals, we are seeking a Site Reliability Engineer (SRE) to join our infrastructure team. This pivotal role focuses on developing software solutions to maintain the reliability of our back-end systems while collaborating with engineering teams to strategize our future growth. You will also engage with our production engineering teams in Europe as we transition from a “you build it, you own it” approach.At Superhuman, our engineers and researchers enjoy the autonomy to innovate and drive breakthroughs, directly impacting our product roadmap. As we rapidly scale our interfaces, algorithms, and infrastructure, the complexity of our technical challenges is growing. Learn more about our technical endeavors on our technical blog.As an SRE, your responsibilities will include:Scaling our Kubernetes-based control plane that processes billions of events each day.Enhancing our automation mechanisms to efficiently respond to workload demands.Deploying machine learning systems across various departments.

Jun 18, 2025

Apply

Site Reliability Engineer at Superhuman | San Francisco

Superhuman

Full-time|$214K/yr - $260K/yr|Hybrid|San Francisco, CA

At Superhuman, we embrace a flexible hybrid working model that combines focused work time with in-person collaboration, fostering trust, innovation, and a vibrant team culture.About SuperhumanSuperhuman, now part of Grammarly, is an AI productivity platform dedicated to unlocking the superhuman potential in everyone. Our suite of applications integrates AI with over 1 million tools and websites, offering innovative solutions such as Grammarly's writing assistance, Coda's collaborative workspaces, Mail's inbox management, and Go, our proactive AI assistant. Since our inception in 2009, we have empowered over 40 million individuals and 50,000 organizations worldwide, enabling them to eliminate busywork and focus on what truly matters. Discover more at superhuman.com and explore our values here.The OpportunityIn pursuit of our ambitious goals, we are seeking a Site Reliability Engineer to enhance our infrastructure team. This pivotal role involves building software that ensures the reliability of our back-end systems while collaborating closely with our engineering teams. You will also help plan for our future growth as we shift from a “you build it, you own it” model.Our engineers and researchers enjoy the freedom to innovate and influence our product roadmap, tackling increasingly complex technical challenges as we scale our systems. Learn more about our technical endeavors on our technical blog.As a Site Reliability Engineer, your responsibilities will include:Scaling our Kubernetes-based control plane, processing billions of events daily.Enhancing our automation mechanisms in response to workload demands.Deploying machine learning systems across the organization.

Mar 18, 2026

Apply

Senior Site Reliability Engineer at Hyperbolic | San Francisco

Hyperbolic Labs

Full-time|On-site|San Francisco, CA

Who We AreAt Hyperbolic Labs, we are committed to democratizing AI by removing barriers to computing power with our Open-Access AI Cloud. By aggregating global computing resources, we provide an innovative GPU marketplace and AI inference service that ensures both affordability and accessibility. As trailblazers at the convergence of AI and open-source technology, we envision a future where AI innovation is only limited by creativity, not by resource availability. We invite forward-thinking individuals who share our dedication to making AI universally accessible, secure, and affordable. Join us in crafting a platform that empowers innovators worldwide to realize their visionary AI projects.In anticipation of our growth following our Series A funding, our team — guided by co-founders with advanced degrees in AI, Mathematics, and Computer Science — is set to transform the computing landscape.About the RoleWe are in search of a skilled Site Reliability Engineer to guarantee that Hyperbolic's GPU marketplace and AI infrastructure function with outstanding reliability, performance, and security. As an aggregator of computational resources from numerous global providers, our service level objectives (SLOs), trust, and economic efficiency are critical to our product. Your key responsibilities will include defining and maintaining service level objectives, developing resilient incident response protocols, managing capacity across our extensive GPU network, and implementing secure rollout and rollback mechanisms to ensure uninterrupted platform operation around the clock.In this influential role, you'll set the reliability benchmarks that foster customer trust in our platform, design comprehensive monitoring and alerting systems for enhanced infrastructure visibility, automate capacity management and resource allocation processes, lead incident response and post-mortem evaluations, and collaborate closely with engineering teams to bolster system resilience. Security and infrastructure hardening will be paramount, necessitating strong isolation protocols between tenants and suppliers, the implementation of effective key management systems, and the establishment of compliance frameworks. This high-impact position will directly affect our ability to deliver on our commitment to providing affordable, accessible AI compute at scale.

Mar 26, 2026

Apply

Founding Platform & Reliability Engineer at OpenArt | San Francisco

OpenArt

Full-time|On-site|San Francisco Bay Area

Founding Platform & Reliability Engineer About OpenArtOpenArt is a revolutionary AI-driven storytelling and visual creation platform utilized by millions around the globe. Our mission is to build the next generation of creative tools powered by advanced AI technology, allowing users to generate videos, visuals, characters, and narratives with speed and creativity never seen before. We envision a future where creativity is inherently AI-native, and we are at the forefront of this transformation. Why Join OpenArt?Be part of a small, dynamic team where senior engineers are responsible for significant systems, not just fragments.Contribute to large-scale projects, with your work impacting millions of users swiftly.Benefit from a founder-led engineering culture where both founders are technical and actively engaged in product and architectural decisions.Work on an AI-native product, crafting how state-of-the-art AI models translate into tangible user experiences.Experience high ownership with minimal bureaucracy, emphasizing judgment, clarity, and speed.Join us during a period of significant growth, with a 7-10X revenue increase over the past two years, and play a pivotal role in scaling the company to new heights. About the RoleWe are seeking a Founding Platform & Reliability Engineer to take charge of the design, scalability, and reliability of our entire infrastructure stack, from high-level architectural choices to hands-on implementation, observability, and cost management.This role is not suited for traditional operators or narrow DevOps specialists. You should be adept at navigating cloud infrastructure, distributed systems, backend services, and developer tools, making practical decisions that optimize product velocity, system reliability, and cost efficiency, particularly in a fast-paced AI-centric landscape.You will collaborate closely with the founders and product engineers to design and refine the platform that powers OpenArt, influencing key decisions like serverless versus containerized architecture, multi-provider AI reliability, and scaling systems for millions of users, while serving as a force multiplier for the entire engineering team. What You’ll DoEstablish and operationalize SLOs/SLIs across essential user journeys (generation, editing, payments/credits, uploads, etc.), utilizing them to guide prioritization (including error budgets).Lead the design and implementation of robust infrastructure solutions that effectively support OpenArt's rapid growth and evolving needs.

Mar 26, 2026

Apply

Senior Site Reliability Engineer at prosper | San Francisco

prosper

Full-time|On-site|San Francisco, CA

Role overview The Senior Site Reliability Engineer at prosper plays a key role in maintaining and improving the reliability and performance of the company’s core systems. Collaboration with teams across the organization is essential to ensure services remain stable and efficient. What you will do Design and set up monitoring tools to track the health and performance of systems Automate routine operational tasks to minimize manual intervention and boost efficiency Diagnose and resolve complex technical problems that impact infrastructure or services Support projects aimed at strengthening infrastructure stability and preparing for future growth Location This role is located in San Francisco, CA.

Apr 27, 2026

Apply

Software Engineer, Infrastructure Reliability at OpenAI | San Francisco

OpenAI

Full-time|On-site|San Francisco

About Our TeamJoin our dynamic Infrastructure organization at OpenAI, where we are actively seeking talented software engineers to bolster our efforts across several high-impact teams. With a variety of focus areas available—including Core Distributed Systems, Databases, Observability, and Cloud Infrastructure—you'll have the opportunity to work on projects that fascinate you. Our teams operate with a high level of autonomy and foster a deeply collaborative environment, all dedicated to enhancing safety, reliability, and operational velocity across the organization.About the RoleAs a Software Engineer focused on Infrastructure Reliability, you will play a pivotal role in scaling and fortifying the infrastructure that supports some of the world’s most widely utilized AI systems. Your work will ensure that our systems maintain high reliability, observability, performance, and security—enabling researchers to iterate rapidly and allowing products like ChatGPT and the OpenAI API to effectively serve millions of users.This hands-on, impactful role is perfect for engineers who enjoy ownership, excel at solving complex technical challenges across the entire stack, and wish to contribute to systems that facilitate cutting-edge research deployed on a global scale. You will significantly influence technical direction, enhance system resilience, and collaborate closely with infrastructure, product, and research teams to transform intricate infrastructure into dependable platforms.Key ResponsibilitiesDesign, construct, and maintain reliable, high-performance systems utilized across engineering.Identify and resolve performance bottlenecks and inefficiencies, ensuring our infrastructure scales appropriately.Investigate and troubleshoot complex issues thoroughly.Enhance automation to minimize manual tasks and improve internal developer tools.Participate in incident response, postmortem analysis, and the development of best practices surrounding system reliability and scalability.Ideal Candidate ProfilePossess a deep understanding of distributed systems principles, with a proven track record in developing and managing scalable, reliable systems.Demonstrate a strong focus on performance and optimization, with the ability to maximize efficiency in complex, globally distributed systems.Have experience managing orchestration systems such as Kubernetes at scale and creating abstractions over cloud platforms.Be comfortable working within Linux environments and possess strong problem-solving skills.

Mar 19, 2026

Apply

Site Reliability Engineer at Blaxel | San Francisco

Blaxel

Full-time|On-site|San Francisco

Join Our Team as a Site Reliability EngineerBlaxel is seeking a highly skilled Site Reliability Engineer to enhance the reliability, performance, and scalability of our cutting-edge AI infrastructure platform.In this role, you will develop and manage the essential systems that support scalable agentic AI. Your primary goal: maintain our ultra-low-latency, stateful, serverless compute engine, ensuring it remains robust as we handle billions of agent requests from the world's most advanced AI teams.This position is deeply technical and execution-oriented. You will take charge of our reliability framework, encompassing observability, performance optimization, incident management, infrastructure health, and the automation processes that ensure seamless operations. We are looking for innovators who can design new reliability systems, advance automation capabilities, and continuously adapt the platform to accommodate next-generation AI workloads. If you are a builder who excels in managing critical infrastructure at scale, we want to hear from you.Your ResponsibilitiesWorking closely with our founders, infrastructure team, and development team—leveraging AI for maximum efficiency—you will architect and manage the systems that keep Blaxel fast, resilient, and secure.Design, operate, and iteratively enhance the core infrastructure that drives our 25ms cold-start compute engine.Develop and refine our observability stack (metrics, traces, logs), ensuring proactive issue detection.Establish, monitor, and drive SLOs/SLIs across vital system components to ensure world-class reliability.Lead incident response with precision: conduct root cause analyses, post-mortems, and implement systemic solutions.Design and deploy self-healing, automated operational systems to minimize manual work and scale operations.Collaborate across compute, networking, storage, and sandboxed execution layers to optimize performance under intense workloads.Create automation tools—often utilizing AI agents—to enhance operations, debugging, capacity planning, and failure predictions.Test and stress our systems to their limits: engage in load testing, chaos engineering, and performance benchmarking.Champion security best practices at the infrastructure level, from sandboxed compute to network isolation.Collaborate with platform engineers to ensure reliability is an integral part of new features from inception.Who You AreExtensive technical expertise in site reliability engineering, with a passion for building scalable systems.

Mar 3, 2026

Apply

Site Reliability Engineer at EngFlow | San Francisco

EngFlow

Full-time|On-site|San Francisco

Join Our Team at EngFlowEngFlow is revolutionizing the software development process by enabling developers to save valuable time in their build and test cycles. Our innovative cloud-based distributed service optimizes workflows through advanced remote execution and caching, significantly enhancing efficiency, productivity, and product quality.Supported by esteemed investors, EngFlow is at the forefront of transforming how organizations develop software and deliver thoroughly tested products. Our solutions can accelerate builds by tenfold or more, and our observability platform provides crucial insights for ongoing optimization. Founded by leading contributors to Bazel, we create tools that empower engineering teams, from startups to Fortune 500 companies, to boost developer velocity and build performance.Discover more about our mission, culture, and team: EngFlow | Watch Our VideoWe are seeking a talented and experienced Site Reliability Engineer to join our dynamic engineering team. In this pivotal role, you will bridge the gap between software engineering and systems operations, ensuring our distributed infrastructure is highly available, performant, and scalable, thereby allowing our engineers to work swiftly and with confidence.

Jan 27, 2026

Apply

Senior Hardware Reliability Engineer at Samsara | San Francisco, CA

Samsara

Full-time|$204K/yr - $240K/yr|Hybrid|San Francisco, CA, United States

Who We AreSamsara (NYSE: IOT) is a trailblazer in the Connected Operations™ Cloud, a platform that empowers organizations reliant on physical operations to leverage Internet of Things (IoT) data for actionable insights and operational improvements. Our mission at Samsara is to enhance the safety, efficiency, and sustainability of the physical operations that underpin the global economy. Covering over 40% of global GDP, these sectors include agriculture, construction, field services, transportation, and manufacturing. We are dedicated to digitally transforming their operations on a large scale.Joining Samsara means you'll be part of a team that's defining the future of physical operations. You will contribute to a dynamic range of product solutions, including Video-Based Safety, Vehicle Telematics, Apps and Driver Workflows, and Equipment Monitoring. As a company that has recently gone public, you will enjoy the autonomy and support to make a significant impact as we build for the future.About the Role:Samsara's Hardware Reliability team plays a crucial role in ensuring an outstanding customer experience through reliable hardware. As a Senior Hardware Reliability Engineer, you will establish quality processes that uphold the high standards of Samsara's hardware.In this role, you will implement and execute comprehensive reliability strategies that cover the entire product development lifecycle, from concept to warranty repair. You will rapidly gather and analyze test, field performance, and manufacturing data to drive necessary actions both internally and with our suppliers, ensuring the production of top-quality products. Collaboration with hardware, firmware, and operations teams is a fundamental aspect of this role.This is a hybrid position open to candidates residing in the US, requiring you to visit our office in San Francisco three times a week.You Should Apply If:You want to impact the industries that run our world: Your efforts will lead to tangible real-world benefits—helping to maintain essential services and support vital industries.

Feb 14, 2026

Apply

Site Reliability Engineer at Latent | San Francisco

Latent

Full-time|On-site|San Francisco

Site Reliability EngineerLocation: San Francisco, CA (5 Days In-Office)As a Site Reliability Engineer at Latent, you will be the backbone of our infrastructure, ensuring the exceptional stability and performance of our cutting-edge clinical AI platform that serves major health systems. Your role is pivotal in enhancing operational excellence, directly impacting patient access to critical treatments.What Makes a Great Engineer at LatentWe seek individuals who are not just technically skilled but also passionate about ownership and high standards. You will thrive in our dynamic, in-office culture where teamwork and a winning mentality are key.Tool Proficiency: You are highly adept with your tools, fluent in command line operations, and skilled in keyboard shortcuts.Ownership: You take pride in managing complex systems and have a successful history of scaling mission-critical deployments.Automation Drive: You have a passion for automation, consistently seeking innovative methods to enhance efficiency and establish operational excellence.Problem Solver: You proactively address challenges, stepping in to resolve issues without waiting for others.Your ResponsibilitiesAs our SRE, you will take full ownership of the production environment and enhance the developer experience:Infrastructure Ownership: Design, implement, and maintain a robust production environment, having experience with over 500 machine deployments.Kubernetes Mastery: Utilize your expertise in Kubernetes and Helm to manage our containerized infrastructure, ensuring optimal deployment, scalability, and operational health.CI/CD & Deployment Optimization: Streamline the deployment pipelines for TypeScript and Python/ML, supporting rapid feature releases while upholding top-notch reliability.DevX Support: Enhance developer workflows by supporting Developer Experience (DevX) initiatives to improve tool proficiency and CI/CD systems.Infrastructure as Code (IaC): Manage infrastructure definitions using Terraform.

Dec 5, 2025

Apply

Senior Platform & Reliability Engineer at Vizcom | San Francisco

Vizcom

Full-time|$200K/yr - $250K/yr|On-site|San Francisco

Agency Notice: We are not currently collaborating with recruiting agencies for this role. We kindly ask that you refrain from contacting Vizcom employees regarding this position. Any resumes submitted without prior agreement will be considered unsolicited.About VizcomVizcom is a cutting-edge visual creation platform that merges advanced web tooling with AI-driven workflows. Our technology stack incorporates React/TypeScript for the front end, Node/Koa + PostGraphile for API services, PostgreSQL, Redis, BullMQ for queuing, and a Kubernetes-based production infrastructure.We are seeking a seasoned expert to oversee platform stability and infrastructure, ensuring our system remains reliable, efficient, and resilient as we scale.Role MissionTake full ownership of service reliability: proactively prevent incidents, minimize impact during failures, and guide swift, high-quality recovery during production downtimes.This role involves hands-on technical leadership, granting you the authority to establish reliability standards and enforce production protocols.CompensationBase salary between $200,000 and $250,000, plus significant equity.Your ResponsibilitiesReliability Standards: Define and uphold SLIs/SLOs/error budgets for key user interactions.Resilience of Production Architecture: Implement failure isolation across APIs, workers, queues, and interdependencies to ensure one subsystem's failure does not disrupt core access.Kubernetes Runtime Reliability: Establish probe contracts, deployment standards, graceful shutdown protocols, scaling/resource policies, and startup safety measures.Queue & Job Safety (BullMQ/Redis): Manage poison pill containment and workload segregation.Incident Command Quality: Lead Sev1/Sev2 incident responses from containment to corrective actions.Reliability Operating System: Oversee observability quality (prioritizing signal over noise), on-call efficiency, runbook maintenance, and postmortem discipline.Deployment Safety Authority: Gate risky deployments and enforce reliability protocols whenever production health is compromised.

Feb 24, 2026

Apply

Senior Site Reliability Engineer at Carta | San Francisco, CA

Carta

Full-time|On-site|San Francisco, California; Santa Clara, California; Seattle, WA

Join Carta as a Senior Site Reliability Engineer, where you will play a pivotal role in enhancing our infrastructure and ensuring the reliability of our platforms. You will work collaboratively with cross-functional teams to implement innovative solutions that drive operational excellence and scalability.

Apr 3, 2026

Create account — see all 11,381 results

1 - 20 of 11,381 Jobs

Select all on this page (20)

Apply

Reliability Engineer at Sieve | San Francisco

Sieve

Full-time|On-site|San Francisco

Feb 5, 2026

Apply

Software Engineering Intern at Sieve | San Francisco

Sieve