Site Reliability Engineer at Blaxel | San Francisco

BlaxelSan Francisco

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

Qualifications:Proficient in cloud infrastructure and automation tools. Experience with monitoring and observability tools. Strong background in incident response and root cause analysis. Ability to design and implement scalable systems. Excellent problem-solving skills and attention to detail.

About the job

Join Our Team as a Site Reliability Engineer

Blaxel is seeking a highly skilled Site Reliability Engineer to enhance the reliability, performance, and scalability of our cutting-edge AI infrastructure platform.

In this role, you will develop and manage the essential systems that support scalable agentic AI. Your primary goal: maintain our ultra-low-latency, stateful, serverless compute engine, ensuring it remains robust as we handle billions of agent requests from the world's most advanced AI teams.

This position is deeply technical and execution-oriented. You will take charge of our reliability framework, encompassing observability, performance optimization, incident management, infrastructure health, and the automation processes that ensure seamless operations. We are looking for innovators who can design new reliability systems, advance automation capabilities, and continuously adapt the platform to accommodate next-generation AI workloads. If you are a builder who excels in managing critical infrastructure at scale, we want to hear from you.

Your Responsibilities

Working closely with our founders, infrastructure team, and development team—leveraging AI for maximum efficiency—you will architect and manage the systems that keep Blaxel fast, resilient, and secure.

Design, operate, and iteratively enhance the core infrastructure that drives our 25ms cold-start compute engine.
Develop and refine our observability stack (metrics, traces, logs), ensuring proactive issue detection.
Establish, monitor, and drive SLOs/SLIs across vital system components to ensure world-class reliability.
Lead incident response with precision: conduct root cause analyses, post-mortems, and implement systemic solutions.
Design and deploy self-healing, automated operational systems to minimize manual work and scale operations.
Collaborate across compute, networking, storage, and sandboxed execution layers to optimize performance under intense workloads.
Create automation tools—often utilizing AI agents—to enhance operations, debugging, capacity planning, and failure predictions.
Test and stress our systems to their limits: engage in load testing, chaos engineering, and performance benchmarking.
Champion security best practices at the infrastructure level, from sandboxed compute to network isolation.
Collaborate with platform engineers to ensure reliability is an integral part of new features from inception.

Who You Are

Extensive technical expertise in site reliability engineering, with a passion for building scalable systems.

About Blaxel

Blaxel is at the forefront of AI technology, dedicated to creating innovative solutions that drive the future of intelligent systems. Our commitment to reliability and performance ensures that our clients receive unparalleled service and support.

Similar jobs

1 - 20 of 12,182 Jobs

Search for Ai Reliability Monitoring Engineering Lead At Postman San Francisco

12,182 results

Select all on this page (20)

Apply

AI Reliability & Monitoring Engineering Lead at Postman | San Francisco

Postman

Full-time|$256K/yr - $276K/yr|On-site|San Francisco, California, United States

Who Are We?Postman stands at the forefront of the API revolution, serving over 45 million developers and 500,000 organizations, including 98% of the Fortune 500. We empower developers and professionals worldwide to construct an API-first ecosystem by simplifying every aspect of the API lifecycle while enhancing collaboration and innovation.Headquartered in San Francisco, our offices span across Boston, New York, Austin, Tokyo, London, and Bangalore—our roots. As a privately held enterprise, we have attracted investments from leading firms including Battery Ventures, BOND, Coatue, CRV, Insight Partners, and Nexus Venture Partners. To dive deeper into our vision, explore The 'API-First World' graphic novel.The OpportunityWe are on the lookout for a skilled AI Systems Reliability Engineer who will play a pivotal role in defining, building, and maintaining the infrastructure and processes that guarantee the reliability, scalability, and performance of our AI-enhanced API and agentic systems in production. This position emphasizes monitoring, availability, incident response, and automation to support AI services and tools relied on by millions globally.What You’ll DoDevelop and manage reliability metrics (SLOs) for AI-driven API services and features of the agentic AI platform.Implement comprehensive observability and monitoring systems for real-time performance and fault detection.

Mar 19, 2026

Apply

Enterprise Solutions Engineer at Postman | San Francisco

Postman

Full-time|$200K/yr - $245K/yr|On-site|San Francisco, California, United States

Who Are We?Postman is the premier API platform, empowering over 45 million developers and 500,000 organizations, including 98% of the Fortune 500. We are driving the shift towards an API-first world by simplifying every phase of the API lifecycle and enhancing collaboration among teams, enabling users to create exceptional APIs with greater efficiency.Headquartered in San Francisco, Postman operates offices in Boston, New York, Austin, Tokyo, London, and Bangalore, where our journey began. We are privately funded by esteemed investors such as Battery Ventures, BOND, Coatue, CRV, Insight Partners, and Nexus Venture Partners. Discover more at postman.com or follow us on X at @getpostman.P.S: We highly encourage you to explore The "API-First World" graphic novel to gain insight into our vision at Postman.The OpportunityWith the growing number of organizations adopting Postman, we are seeking a talented Enterprise Solutions Engineer to join our team and contribute to the expansion of our enterprise business. You will collaborate closely with our sales team to foster an API-first development culture, build lasting customer relationships, and guide Postman users in maximizing our platform for effective API development. We are looking for someone passionate about APIs, experienced in enterprise sales, proficient in JavaScript, and a skilled Postman user!You will benefit from the support of our sales, customer success, product, and engineering teams, all committed to your success in this role.What You’ll DoDrive enterprise sales by nurturing prospects and supporting strategic and enterprise customers.Conduct discovery sessions, qualification processes, technical demonstrations, and proof of value workshops for potential customers eager to adopt Postman for their API lifecycle.Address technical inquiries or objections regarding Postman, providing solutions or workarounds to meet customer requirements.Gain a deep understanding of our customers' current workflows and guide them in implementing API-first development best practices.Share customer insights with relevant teams and serve as a general advocate for customer needs.Stay informed about the competitive landscape, trends, and challenges in the API market.Create proof of concept integrations, tools, and workflows as necessary to support prospective customers.

Mar 17, 2026

Apply

Staff Engineer - Desktop Platform at Postman | San Francisco

Postman

Full-time|On-site|San Francisco, California, United States

Who Are We?Postman is recognized as the leading API platform globally, empowering over 45 million developers and 500,000 organizations, including 98% of the Fortune 500. We are dedicated to transforming the API landscape by simplifying every phase of the API lifecycle, fostering collaboration, and enabling users to create superior APIs with increased efficiency.Headquartered in San Francisco, we also have offices in Boston, New York, Austin, Tokyo, London, and Bangalore—our founding city. As a privately held company, we have received support from notable investors such as Battery Ventures, BOND, Coatue, CRV, Insight Partners, and Nexus Venture Partners. To learn more about our vision, visit postman.com or connect with us on X at @getpostman.P.S: We highly recommend reading The 'API-First World' graphic novel to gain insights into our vision at Postman.The OpportunityAs a Staff Engineer for the Desktop Platform, you will lead the technical direction of our Electron-based desktop application, which is utilized by over 40 million developers worldwide. Your expertise will influence its architecture, performance, and security for the foreseeable future.This role is primarily focused on individual contributions. You will engage in both design and implementation while serving as the technical cornerstone for our Desktop Platform. Your responsibilities will include defining architecture, developing essential features, and mentoring fellow engineers, ultimately guiding Postman’s evolution across various desktop operating systems.About the TeamThe Client Platform organization is tasked with providing a streamlined pathway for developing, testing, and releasing Postman’s web and desktop applications. We manage the frameworks, SDKs, build tools, shared libraries, and infrastructure that our product engineers depend on daily. Our platform serves as the foundation for hundreds of Postman engineers to deliver new features and enhancements to over 40 million users.Within this group, the Desktop Platform team focuses on Postman’s integration with Electron and OS-level infrastructure. Our team ensures that Postman is easily installable, consistently updatable, and performs optimally across Windows, macOS, and Linux. We collaborate closely with the Client SDK and Build/Release teams to guarantee a seamless user experience.What You’ll DoLead the architecture of Postman’s Electron-based desktop application.Design and implement core desktop functionalities, including installers, auto-updates, and OS integrations.Enhance performance across desktop platforms, optimizing startup times, memory, and CPU utilization.

Feb 4, 2026

Apply

Deal Operations Analyst at Postman | San Francisco

Postman

Full-time|$85K/yr - $140K/yr|On-site|San Francisco, California, United States

About Postman Postman is a leading API platform, trusted by over 45 million developers and 500,000 organizations worldwide, including 98% of the Fortune 500. The company’s mission is to simplify the API lifecycle and help teams collaborate, so users can build better APIs, faster. Headquartered in San Francisco, Postman also has offices in Boston, New York, Austin, Tokyo, London, and Bangalore. Postman is privately held, backed by investors such as Battery Ventures, BOND, Coatue, CRV, Insight Partners, and Nexus Venture Partners. Learn more at postman.com or follow @getpostman on X. For a deeper look at Postman’s vision, check out The "API-First World" graphic novel. Role Overview: Deal Operations Analyst (San Francisco, In-Office) The Deal Operations Analyst joins the Deal Operations team, reporting to the Head of Deal Operations (who reports to Revenue Operations). This role works closely with sales, finance, legal, renewals, and orders teams throughout the deal cycle. The analyst helps ensure every deal follows Postman’s policies and procedures, from quote creation to booking. This is a full-time, in-office position based in San Francisco. Key Responsibilities Create quotes and order forms for sales transactions. Review and approve qualified quote requests. Oversee the booking process for customer orders, maintaining quality and efficiency at each stage. Support both internal and external clients with order booking and tracking questions. Handle escalations related to exception management for booking requests. Investigate discrepancies in Annual Recurring Revenue (ARR) calculations. Assist Deal Operations and Sales with unique, custom, or complex deal structures to ensure quotes and order forms accurately reflect each agreement. Coordinate with internal teams, including finance, legal, renewals, and collections, to facilitate the order process, resolve issues, and achieve shared goals.

Apr 16, 2026

Apply

Account Development Representative at Postman | San Francisco

Postman

Full-time|On-site|San Francisco, California, United States

Role overview The Account Development Representative at Postman plays a key part in expanding the reach of Postman’s API development platform. The focus is on connecting with potential clients and helping to drive platform adoption within the developer and business community. What you will do Identify and contact prospective customers interested in API solutions Engage with leads and foster strong client relationships Collaborate with sales and marketing teams to uncover new business opportunities Contribute to initiatives that encourage wider use of Postman’s platform by developers and organizations Location This role is based in San Francisco, California.

Apr 27, 2026

Apply

SEO Manager at Postman | New York, San Francisco

Postman

Full-time|Remote|New York, New York, United States; San Francisco, California, United States

Join Postman as an SEO Manager and lead our efforts in enhancing search engine visibility and driving organic traffic. You will be responsible for developing and implementing SEO strategies that align with our business goals. Collaborate with cross-functional teams to optimize existing content and create new strategies that elevate our brand in the digital marketplace.

Apr 10, 2026

Apply

Lead Developer for AI Agent Technology

Postman

Full-time|$256K/yr - $276K/yr|On-site|San Francisco, California, United States

Who Are We?Postman is the premier API platform, empowering over 45 million developers and 500,000 organizations globally, including 98% of the Fortune 500. Our mission is to enable the creation of an API-first world by simplifying every phase of the API lifecycle and enhancing collaboration, allowing users to build superior APIs more efficiently.Headquartered in San Francisco, Postman also has offices in Boston, New York, Austin, Tokyo, London, and Bangalore, where we originated. As a privately held company, we are backed by prominent investors such as Battery Ventures, BOND, Coatue, CRV, Insight Partners, and Nexus Venture Partners. Discover more at postman.com or engage with us on X via @getpostman.P.S: We encourage you to read The "API-First World" graphic novel to gain insight into our vision at Postman.The OpportunityAs the AI Agent Development Lead, you will spearhead the design, development, and deployment of cutting-edge AI agents that engage with users and complex environments. You will play a crucial role in shaping the architecture and execution of scalable, dependable AI systems, collaborating closely with research, product, and engineering teams to create safe, interpretable, and efficient AI technologies.What You’ll DoLead a diverse engineering team focused on AI agent development from initial design through to production deployment.Design and implement AI agent architectures utilizing cutting-edge language models and related technologies.Work alongside research scientists to conduct scalable experiments and translate research innovations into product features.Drive the evolution of agent capabilities, including dialogue management, decision-making, and autonomy.Ensure that AI safety and alignment principles are embedded throughout the agent lifecycle.Mentor and develop technical team members, fostering a culture of collaboration and innovation.Assess new tools, frameworks, and methodologies to enhance AI agent development.

Mar 19, 2026

Apply

Reliability Engineer at Sieve | San Francisco

Sieve

Full-time|On-site|San Francisco

About SieveSieve stands as a pioneering AI research lab dedicated solely to video data. Our innovative approach integrates exabyte-scale video infrastructure with state-of-the-art video understanding techniques and a myriad of data sources, creating unparalleled datasets that redefine video modeling. With video accounting for 80% of global internet traffic, it has become the vital digital medium fueling creativity, communication, gaming, AR/VR, and robotics. At Sieve, we aim to eliminate the most significant bottleneck hindering the expansion of these applications: access to high-quality training data.With strategic partnerships with leading AI labs, our team of just 12 has achieved remarkable financial success, generating $XXM last quarter alone. Earlier this year, we secured Series A funding from elite firms including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleAs we process petabytes of video across numerous nodes and cloud environments, ensuring reliability, observability, and security is essential to our growth.We are seeking our inaugural Reliability Engineer, who will focus entirely on fortifying the infrastructure that underpins Sieve. This role demands high ownership and a deep understanding of:System throughput and stabilityMonitoring and incident managementSecurity principles, including least-privilege designMinimizing operational burdens for the entire engineering teamYou will collaborate closely with our CTO and founding engineers to develop the foundational tools that empower our engineering efforts.This position is ideal for an engineer who is passionate about reliability, throughput, observability, and security. You are proactive in anticipating potential failure modes, reducing operational risks, and designing resilient systems.If a system failure occurs, you take it personally, thriving under the weight of responsibility.What You'll Be DoingCollaborate with engineering to design and validate infrastructure supporting PB-scale workloadsDevelop and manage Terraform-based multi-cloud deploymentsEnhance cloud and data security (SSO, IAM, least privilege access, auditability)Lead incident response efforts and strengthen systems against failuresCreate CI/CD systems to minimize user errors and maximize safetyEstablish monitoring and alerting frameworks (Prometheus, OpenTelemetry, VictoriaMetrics)

Feb 5, 2026

Apply

Site Reliability Engineer at Mercor | San Francisco

Mercor

Full-time|On-site|San Francisco

Join the Mercor TeamAt Mercor, we stand at the dynamic intersection of labor markets and AI research. Collaborating with premier AI labs and enterprises, we empower the human intelligence that is crucial for AI's evolution.Our expansive talent network plays a vital role in training cutting-edge AI models, akin to the way educators impart knowledge to their students—by sharing insights, experiences, and contextual understanding that code alone cannot convey. Currently, our network of over 30,000 experts generates more than $2 million daily.We are pioneering a novel category of work where expertise fuels AI progress. Achieving this vision necessitates an ambitious, fast-paced, and deeply dedicated team. You will collaborate with researchers, operators, and AI firms that are at the forefront of transforming societal structures.Mercor is a thriving Series C company with a valuation of $10 billion. We operate five days a week in-person at our new headquarters in San Francisco.About the RoleAs a Site Reliability Engineer (SRE) at Mercor, you will take ownership of production reliability for our critical systems, working closely with our infrastructure leadership. You will play a pivotal role in establishing our SRE function and defining how Mercor manages large-scale, high-availability systems.Your ResponsibilitiesEnsure the reliability and safety of production for key shared services and customer-facing systems.Collaborate directly with infrastructure leadership to outline SRE priorities, reliability benchmarks, and the production safety roadmap.Enhance the structure of our production systems to ensure stability, resource efficiency, isolation, and observability.Advocate for and implement modern SRE methodologies (e.g., incident management, postmortems, SLIs/SLOs) across engineering teams.Work alongside engineering and applied AI teams to facilitate sustainable growth.Promote SRE best practices internally, supporting teams in a safe, scalable, and consistent production onboarding process.Who We SeekThe ideal candidate will have:Extensive experience in genuine SRE roles (not merely operations) across various positions or organizations.A deep understanding of SRE methodologies popularized by Google (e.g., error budgets, reliability vs. risk trade-offs, large-scale distributed systems).5+ years of SRE experience; ideally, 15+ years in total experience for this inaugural SRE position.A proven track record of managing systems at scale, with a strong grasp of the complexities involved.

Dec 27, 2025

Apply

AI Engineer - San Francisco

Aegis AI

Full-time|On-site|San Francisco

Join a pioneering team of former Google engineers who have developed ground-breaking defensive technologies, such as Safe Browsing and reCAPTCHA. We are on a mission to confront an urgent challenge: combating the rising tide of adversarial AI attacks that threaten organizations globally.Operating in stealth mode, we are targeting a lucrative $5B+ market that is primed for innovation. Conventional detection methodologies are proving inadequate against the speed and sophistication of AI-driven assaults. Current adversaries are leveraging AI to engineer tailored, high-evasion attacks, leaving traditional systems vulnerable.Your Role:You will design a network of AI agents that are rapid, cost-effective, and precise, collaborating to identify and neutralize emerging threats. Your work will dive deep into real-time threat data, continuously evolving your agents in a fast-paced environment. These agents will function under an orchestration layer that fosters quick adaptation and learning.The Excitement of the ChallengeRapidly Evolving Models: The landscape changes daily; solutions that worked yesterday may be outdated today.Intelligent Adversaries: We are engaged in a real-time arms race against cunning, AI-enhanced attackers crafting sophisticated payloads.No Existing Playbook: We are forging new detection paradigms as swiftly as threats evolve. This high-stakes work places you in the heart of the action from day one.If you thrive on solving challenging problems with rapid feedback, this is your opportunity.Why We Are Positioned to SucceedExpansive Market: The market is vast at $5B and expanding quickly, while established players struggle to adapt.Proven Track Record: Our team has previously developed the foundational technology for Safe Browsing (serving over 5B users) and reCAPTCHA (protecting more than 5M websites) during our time at Google.Experienced Team: This is our third endeavor in creating a category-defining security enterprise, and we know how to scale our technology and our organization effectively.Deeply Integrated AI and Security: We embed AI from the outset rather than layering it on top.Top Talent: We hire only the highest achievers; many on our team were in the top 1% of engineers at Google. If you excelled in your previous role, you will fit right in.Agility: We prioritize speed and efficiency in everything we do.

Sep 10, 2025

Apply

Director of AI Platform Engineering

Postman

Full-time|$310K/yr - $400K/yr|On-site|San Francisco, California, United States

Who Are We?Postman stands as the premier API platform globally, empowering over 45 million developers and 500,000 organizations, including 98% of the Fortune 500. Our mission is to facilitate a seamless API-first world by simplifying every stage of the API lifecycle and enhancing collaboration—enabling users to build superior APIs more efficiently.Headquartered in San Francisco, Postman has a global presence with offices in Boston, New York, Austin, Tokyo, London, and Bangalore, where we were established. As a privately funded company, we are backed by esteemed investors including Battery Ventures, BOND, Coatue, CRV, Insight Partners, and Nexus Venture Partners. Discover more at postman.com or engage with us on X via @getpostman.P.S: We encourage you to explore The "API-First World" graphic novel for a deeper insight into our vision at Postman.The OpportunityIn the role of Head of AI Platform Engineering at Postman, you will spearhead the integration of AI development within our expanding API platform. You will be responsible for steering the AI roadmap with an emphasis on enhancing AI-driven API collaboration and intelligent capabilities throughout the platform. This position seeks a visionary leader adept at identifying market trends, orchestrating cross-functional AI initiatives, and cultivating robust partnerships to elevate the impact of the Postman AI platform.What You’ll DoDirect the formulation and execution of Postman’s AI platform strategy, concentrating on the growth of the API ecosystem and innovation within the platform.Lead the AI roadmap, focusing on API integration, platform expansion, and AI-enabled agent functionalities.Identify and leverage market opportunities for AI-enhanced API collaboration and intelligent agent features.Work collaboratively with business units, product teams, engineering, and external partners to ensure successful deployment of AI initiatives.Oversee implementation with major AI platforms (OpenAI, Anthropic, AWS, etc.), ensuring both technical and strategic alignment with AI features and improvements to the API lifecycle.Inspire and mentor internal teams and external collaborators to embrace innovation in AI within API contexts.Continuously assess emerging AI technologies and tools to enhance platform capabilities.About YouDemonstrated leadership experience managing AI or platform teams, ideally within API-centric environments.

Mar 2, 2026

Apply

Software Engineer, Infrastructure Reliability at OpenAI | San Francisco

OpenAI

Full-time|On-site|San Francisco

About Our TeamJoin our dynamic Infrastructure organization at OpenAI, where we are actively seeking talented software engineers to bolster our efforts across several high-impact teams. With a variety of focus areas available—including Core Distributed Systems, Databases, Observability, and Cloud Infrastructure—you'll have the opportunity to work on projects that fascinate you. Our teams operate with a high level of autonomy and foster a deeply collaborative environment, all dedicated to enhancing safety, reliability, and operational velocity across the organization.About the RoleAs a Software Engineer focused on Infrastructure Reliability, you will play a pivotal role in scaling and fortifying the infrastructure that supports some of the world’s most widely utilized AI systems. Your work will ensure that our systems maintain high reliability, observability, performance, and security—enabling researchers to iterate rapidly and allowing products like ChatGPT and the OpenAI API to effectively serve millions of users.This hands-on, impactful role is perfect for engineers who enjoy ownership, excel at solving complex technical challenges across the entire stack, and wish to contribute to systems that facilitate cutting-edge research deployed on a global scale. You will significantly influence technical direction, enhance system resilience, and collaborate closely with infrastructure, product, and research teams to transform intricate infrastructure into dependable platforms.Key ResponsibilitiesDesign, construct, and maintain reliable, high-performance systems utilized across engineering.Identify and resolve performance bottlenecks and inefficiencies, ensuring our infrastructure scales appropriately.Investigate and troubleshoot complex issues thoroughly.Enhance automation to minimize manual tasks and improve internal developer tools.Participate in incident response, postmortem analysis, and the development of best practices surrounding system reliability and scalability.Ideal Candidate ProfilePossess a deep understanding of distributed systems principles, with a proven track record in developing and managing scalable, reliable systems.Demonstrate a strong focus on performance and optimization, with the ability to maximize efficiency in complex, globally distributed systems.Have experience managing orchestration systems such as Kubernetes at scale and creating abstractions over cloud platforms.Be comfortable working within Linux environments and possess strong problem-solving skills.

Mar 19, 2026

Apply

Founding Platform & Reliability Engineer at OpenArt | San Francisco

OpenArt

Full-time|On-site|San Francisco Bay Area

Founding Platform & Reliability Engineer About OpenArtOpenArt is a revolutionary AI-driven storytelling and visual creation platform utilized by millions around the globe. Our mission is to build the next generation of creative tools powered by advanced AI technology, allowing users to generate videos, visuals, characters, and narratives with speed and creativity never seen before. We envision a future where creativity is inherently AI-native, and we are at the forefront of this transformation. Why Join OpenArt?Be part of a small, dynamic team where senior engineers are responsible for significant systems, not just fragments.Contribute to large-scale projects, with your work impacting millions of users swiftly.Benefit from a founder-led engineering culture where both founders are technical and actively engaged in product and architectural decisions.Work on an AI-native product, crafting how state-of-the-art AI models translate into tangible user experiences.Experience high ownership with minimal bureaucracy, emphasizing judgment, clarity, and speed.Join us during a period of significant growth, with a 7-10X revenue increase over the past two years, and play a pivotal role in scaling the company to new heights. About the RoleWe are seeking a Founding Platform & Reliability Engineer to take charge of the design, scalability, and reliability of our entire infrastructure stack, from high-level architectural choices to hands-on implementation, observability, and cost management.This role is not suited for traditional operators or narrow DevOps specialists. You should be adept at navigating cloud infrastructure, distributed systems, backend services, and developer tools, making practical decisions that optimize product velocity, system reliability, and cost efficiency, particularly in a fast-paced AI-centric landscape.You will collaborate closely with the founders and product engineers to design and refine the platform that powers OpenArt, influencing key decisions like serverless versus containerized architecture, multi-provider AI reliability, and scaling systems for millions of users, while serving as a force multiplier for the entire engineering team. What You’ll DoEstablish and operationalize SLOs/SLIs across essential user journeys (generation, editing, payments/credits, uploads, etc.), utilizing them to guide prioritization (including error budgets).Lead the design and implementation of robust infrastructure solutions that effectively support OpenArt's rapid growth and evolving needs.

Mar 26, 2026

Apply

Site Reliability Engineer at Blaxel | San Francisco

Blaxel

Full-time|On-site|San Francisco

Join Our Team as a Site Reliability EngineerBlaxel is seeking a highly skilled Site Reliability Engineer to enhance the reliability, performance, and scalability of our cutting-edge AI infrastructure platform.In this role, you will develop and manage the essential systems that support scalable agentic AI. Your primary goal: maintain our ultra-low-latency, stateful, serverless compute engine, ensuring it remains robust as we handle billions of agent requests from the world's most advanced AI teams.This position is deeply technical and execution-oriented. You will take charge of our reliability framework, encompassing observability, performance optimization, incident management, infrastructure health, and the automation processes that ensure seamless operations. We are looking for innovators who can design new reliability systems, advance automation capabilities, and continuously adapt the platform to accommodate next-generation AI workloads. If you are a builder who excels in managing critical infrastructure at scale, we want to hear from you.Your ResponsibilitiesWorking closely with our founders, infrastructure team, and development team—leveraging AI for maximum efficiency—you will architect and manage the systems that keep Blaxel fast, resilient, and secure.Design, operate, and iteratively enhance the core infrastructure that drives our 25ms cold-start compute engine.Develop and refine our observability stack (metrics, traces, logs), ensuring proactive issue detection.Establish, monitor, and drive SLOs/SLIs across vital system components to ensure world-class reliability.Lead incident response with precision: conduct root cause analyses, post-mortems, and implement systemic solutions.Design and deploy self-healing, automated operational systems to minimize manual work and scale operations.Collaborate across compute, networking, storage, and sandboxed execution layers to optimize performance under intense workloads.Create automation tools—often utilizing AI agents—to enhance operations, debugging, capacity planning, and failure predictions.Test and stress our systems to their limits: engage in load testing, chaos engineering, and performance benchmarking.Champion security best practices at the infrastructure level, from sandboxed compute to network isolation.Collaborate with platform engineers to ensure reliability is an integral part of new features from inception.Who You AreExtensive technical expertise in site reliability engineering, with a passion for building scalable systems.

Mar 3, 2026

Apply

Lead AI Engineer at Hilbert | San Francisco

Hilbert

Full-time|On-site|San Francisco

Join Hilbert, a pioneering growth engine that leverages data science to provide B2C teams with predictive insights into user behavior, key revenue drivers, and strategies for sustainable growth. Designed for agility, Hilbert condenses lengthy decision-making processes into mere minutes.Trusted by Fortune 10 enterprises and beloved brands such as FreshDirect, Blank Street, and Levain Bakery, our platform is the backbone of their growth strategies. We are also collaborating with leading AI companies to co-create innovative solutions.We are in search of a Lead AI Engineer who is ready to take charge of the technical direction of Hilbert's AI infrastructure, engage in hands-on development of production-grade systems, and enhance our expanding engineering team with a founder's mindset of ownership and urgency.This position is not about managing from afar; you will be actively coding, making architectural decisions, establishing high standards for quality and speed, and cultivating our engineering culture as we grow. You will be the go-to person for navigating ambiguous challenges, high-stakes situations, and uncharted paths. If you possess a deep technical expertise coupled with strong leadership and communication skills, we would love to connect with you.THE ROLEYou will collaborate closely with the founding team and various departments, including product, data, and go-to-market teams to lead the design, development, and continuous improvement of the AI systems that power Hilbert. You will be deeply involved in coding daily while also defining how we approach development, prioritize tasks, and grow the engineering team. Our environment is characterized by high autonomy and ambiguity, which is inherent in building AI-driven products. Requirements may shift, methodologies can evolve, and the individual closest to the issue will often make the pivotal decisions. As the Lead, it is your responsibility to ensure the team is empowered to make those informed decisions effectively.Build: Hands-on, every dayDesign, build, and maintain AI-powered features and pipelines tailored for enterprise customers at scale.Architect and implement agent-based workflows utilizing frameworks such as LangChain or LangGraph.Take ownership of critical systems from experimentation to production deployment and monitoring.Develop and refine evaluation pipelines to assess, validate, and iterate on AI system performance.Make pragmatic engineering decisions in the face of uncertainty; ship, learn, and iterate.Lead: Set direction and elevate standardsDefine and own the technical roadmap for the AI stack in collaboration with the founding team.Make architectural and infrastructural decisions that drive our AI initiatives forward.

Feb 26, 2026

Apply

Senior Site Reliability Engineer at Hyperbolic | San Francisco

Hyperbolic Labs

Full-time|On-site|San Francisco, CA

Who We AreAt Hyperbolic Labs, we are committed to democratizing AI by removing barriers to computing power with our Open-Access AI Cloud. By aggregating global computing resources, we provide an innovative GPU marketplace and AI inference service that ensures both affordability and accessibility. As trailblazers at the convergence of AI and open-source technology, we envision a future where AI innovation is only limited by creativity, not by resource availability. We invite forward-thinking individuals who share our dedication to making AI universally accessible, secure, and affordable. Join us in crafting a platform that empowers innovators worldwide to realize their visionary AI projects.In anticipation of our growth following our Series A funding, our team — guided by co-founders with advanced degrees in AI, Mathematics, and Computer Science — is set to transform the computing landscape.About the RoleWe are in search of a skilled Site Reliability Engineer to guarantee that Hyperbolic's GPU marketplace and AI infrastructure function with outstanding reliability, performance, and security. As an aggregator of computational resources from numerous global providers, our service level objectives (SLOs), trust, and economic efficiency are critical to our product. Your key responsibilities will include defining and maintaining service level objectives, developing resilient incident response protocols, managing capacity across our extensive GPU network, and implementing secure rollout and rollback mechanisms to ensure uninterrupted platform operation around the clock.In this influential role, you'll set the reliability benchmarks that foster customer trust in our platform, design comprehensive monitoring and alerting systems for enhanced infrastructure visibility, automate capacity management and resource allocation processes, lead incident response and post-mortem evaluations, and collaborate closely with engineering teams to bolster system resilience. Security and infrastructure hardening will be paramount, necessitating strong isolation protocols between tenants and suppliers, the implementation of effective key management systems, and the establishment of compliance frameworks. This high-impact position will directly affect our ability to deliver on our commitment to providing affordable, accessible AI compute at scale.

Mar 26, 2026

Apply

Site Reliability Engineer at Latent | San Francisco

Latent

Full-time|On-site|San Francisco

Site Reliability EngineerLocation: San Francisco, CA (5 Days In-Office)As a Site Reliability Engineer at Latent, you will be the backbone of our infrastructure, ensuring the exceptional stability and performance of our cutting-edge clinical AI platform that serves major health systems. Your role is pivotal in enhancing operational excellence, directly impacting patient access to critical treatments.What Makes a Great Engineer at LatentWe seek individuals who are not just technically skilled but also passionate about ownership and high standards. You will thrive in our dynamic, in-office culture where teamwork and a winning mentality are key.Tool Proficiency: You are highly adept with your tools, fluent in command line operations, and skilled in keyboard shortcuts.Ownership: You take pride in managing complex systems and have a successful history of scaling mission-critical deployments.Automation Drive: You have a passion for automation, consistently seeking innovative methods to enhance efficiency and establish operational excellence.Problem Solver: You proactively address challenges, stepping in to resolve issues without waiting for others.Your ResponsibilitiesAs our SRE, you will take full ownership of the production environment and enhance the developer experience:Infrastructure Ownership: Design, implement, and maintain a robust production environment, having experience with over 500 machine deployments.Kubernetes Mastery: Utilize your expertise in Kubernetes and Helm to manage our containerized infrastructure, ensuring optimal deployment, scalability, and operational health.CI/CD & Deployment Optimization: Streamline the deployment pipelines for TypeScript and Python/ML, supporting rapid feature releases while upholding top-notch reliability.DevX Support: Enhance developer workflows by supporting Developer Experience (DevX) initiatives to improve tool proficiency and CI/CD systems.Infrastructure as Code (IaC): Manage infrastructure definitions using Terraform.

Dec 5, 2025

Apply

Support Engineering Lead at Retell AI | San Francisco Bay Area

Retell AI

Full-time|$170K/yr - $230K/yr|On-site|San Francisco Bay Area

About Retell AIRetell AI is pioneering the future of customer service by leveraging first principles to transform the call center experience through advanced voice AI technology.In just 18 months since our inception, our innovative AI voice agents have been utilized by thousands of companies to efficiently manage sales, support, and logistics calls, significantly reducing the need for large teams of human agents. With the backing of esteemed investors such as Y Combinator and Alt Capital, we have achieved an impressive $36 million ARR, growing from $5 million at the beginning of 2025 with a dedicated team of 20 professionals.Our ambitious vision for 2026 is to develop a state-of-the-art CX platform powered entirely by AI. We aim to create intelligent AI 'workers' that will function as frontline agents, QA analysts, and managers, capable of autonomously executing, monitoring, and enhancing customer interactions.We are rapidly expanding and seeking passionate innovators eager to solve challenging technical problems, work swiftly, and make a tangible impact at one of the fastest-growing voice AI startups. Join us in shaping the future.Recognized as a top 50 AI app in a16z's list: https://tinyurl.com/5853dt2xRanked #4 on Brex's Fast-Growing Software Vendors of 2025: https://www.brex.com/journal/brex-benchmark-december-2025Listed among the top-ranking startups at: https://leanaileaderboard.com/About the RoleAs the Support Engineering Lead at Retell, you will be responsible for overseeing the technical support operations of our sophisticated voice AI platform. This hands-on leadership position involves directly troubleshooting complex customer challenges, developing AI agents and automations, and leading an expanding team of support engineers.You will work at the intersection of engineering, customer relations, and product development, ensuring unparalleled reliability, swift issue resolution, and scalable support systems as Retell continues its growth trajectory.This position is perfect for individuals who thrive on resolving intricate technical issues under pressure, engaging with clients directly, and constructing systems and teams from the ground up.

Jan 14, 2026

Apply

GTM Recruiting Lead at Retell AI | San Francisco

Retell AI

Full-time|$180K/yr - $240K/yr|On-site|San Francisco Bay Area

About Retell AI Retell AI builds voice AI technology for call centers, helping thousands of companies handle sales, support, and logistics calls with AI voice agents. Since launching 18 months ago, the company has grown annual recurring revenue from $5M to $50M, backed by investors including Y Combinator and Alt Capital. Our Mission By 2026, Retell AI aims to deliver a customer experience platform powered entirely by AI. The vision goes beyond basic automation: intelligent agents will take on roles as frontline workers, quality assurance analysts, and managers, improving customer interactions while reducing the need for ongoing human involvement. Recognition Ranked among the top 50 AI applications by a16z (a16z List) #4 on Brex's Fast-Growing Software Vendors of 2025 (Brex Benchmark) High placement on the Leana AI Leaderboard Working at Retell AI The team is expanding quickly and welcomes those who want to solve tough problems and help shape the future of voice AI. Retell AI is based in the San Francisco Bay Area and continues to grow as one of the industry's fastest-moving startups.

Apr 14, 2026

Apply

Senior Site Reliability Engineer at Drata | San Francisco

Drata

Full-time|$166.9K/yr - $225.9K/yr|Hybrid|Hybrid - San Francisco

Drata helps organizations demonstrate their commitment to security and integrity. The platform supports companies as they build and maintain trust with users, customers, partners, and prospects. Values Built on Trust: Consistency shapes decisions and actions. Integrity: Choosing to do what is right, every time. Customer-Obsessed: Prioritizing customer needs above all else. Competitive Fire: Striving for higher standards and greater achievements. Diversity: Welcoming different perspectives to encourage creative solutions. Automation First: Pursuing efficiency by saving time and resources wherever possible. How the Team Works Drata blends high standards with a supportive environment focused on growth. Team members are encouraged to own their work, improve continuously, and deliver meaningful results. The company values quick, informed decisions that drive immediate impact, while always keeping the mission and customer needs at the center. The San Francisco-based team uses a hybrid work model. Colleagues collaborate in the office Tuesday through Thursday, focusing on alignment and innovation. Mondays and Fridays offer flexibility for deep work or personal needs. Growth and Culture Drata has expanded to over 600 professionals worldwide, recognized for a culture that values trust, speed, and continuous learning. The environment supports both personal and professional development. See the Speed: CEO Adam Markowitz discusses Drata’s rapid journey to $100M ARR in four years. Hear the Voice of the Team: Employee stories highlight collaboration and growth at Drata.

Apr 27, 2026

Create account — see all 12,182 results