companyBlaxel logo

Site Reliability Engineer at Blaxel | San Francisco

BlaxelSan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

Qualifications:Proficient in cloud infrastructure and automation tools. Experience with monitoring and observability tools. Strong background in incident response and root cause analysis. Ability to design and implement scalable systems. Excellent problem-solving skills and attention to detail.

About the job

Join Our Team as a Site Reliability Engineer

Blaxel is seeking a highly skilled Site Reliability Engineer to enhance the reliability, performance, and scalability of our cutting-edge AI infrastructure platform.

In this role, you will develop and manage the essential systems that support scalable agentic AI. Your primary goal: maintain our ultra-low-latency, stateful, serverless compute engine, ensuring it remains robust as we handle billions of agent requests from the world's most advanced AI teams.

This position is deeply technical and execution-oriented. You will take charge of our reliability framework, encompassing observability, performance optimization, incident management, infrastructure health, and the automation processes that ensure seamless operations. We are looking for innovators who can design new reliability systems, advance automation capabilities, and continuously adapt the platform to accommodate next-generation AI workloads. If you are a builder who excels in managing critical infrastructure at scale, we want to hear from you.

Your Responsibilities

Working closely with our founders, infrastructure team, and development team—leveraging AI for maximum efficiency—you will architect and manage the systems that keep Blaxel fast, resilient, and secure.

  • Design, operate, and iteratively enhance the core infrastructure that drives our 25ms cold-start compute engine.

  • Develop and refine our observability stack (metrics, traces, logs), ensuring proactive issue detection.

  • Establish, monitor, and drive SLOs/SLIs across vital system components to ensure world-class reliability.

  • Lead incident response with precision: conduct root cause analyses, post-mortems, and implement systemic solutions.

  • Design and deploy self-healing, automated operational systems to minimize manual work and scale operations.

  • Collaborate across compute, networking, storage, and sandboxed execution layers to optimize performance under intense workloads.

  • Create automation tools—often utilizing AI agents—to enhance operations, debugging, capacity planning, and failure predictions.

  • Test and stress our systems to their limits: engage in load testing, chaos engineering, and performance benchmarking.

  • Champion security best practices at the infrastructure level, from sandboxed compute to network isolation.

  • Collaborate with platform engineers to ensure reliability is an integral part of new features from inception.

Who You Are

  • Extensive technical expertise in site reliability engineering, with a passion for building scalable systems.

About Blaxel

Blaxel is at the forefront of AI technology, dedicated to creating innovative solutions that drive the future of intelligent systems. Our commitment to reliability and performance ensures that our clients receive unparalleled service and support.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.