companyFluidstack logo

Senior / Staff Site Reliability Engineer at Fluidstack | San Francisco, CA

FluidstackSan Francisco, CA
On-site Full-time $175K/yr - $320K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

The ideal candidate will possess:A strong customer-centric approach, accountability, and a proactive mindset. A proven track record of delivering clean, well-documented code in complex environments.

About the job

About Fluidstack

At Fluidstack, we are pioneering the infrastructure for advanced intelligence. Collaborating with leading AI laboratories, governmental bodies, and enterprises such as Mistral, Poolside, Black Forest Labs, and Meta, we aim to unlock computational power at unprecedented speeds.

Our mission is urgent: to turn Artificial General Intelligence (AGI) into a tangible reality. Our team is driven, dedicated to delivering top-tier infrastructure, and we treat the outcomes of our customers as if they were our own, taking immense pride in the systems we develop and the trust we establish. If you are purpose-driven, passionate about excellence, and ready to work diligently to propel the future of intelligence, we invite you to join us in shaping what comes next.

About the Role

As a Senior / Staff Site Reliability Engineer (SRE) at Fluidstack, you will be central to our infrastructure, working across software, hardware, and operations to ensure the reliability and performance of our global GPU cloud.

You will collaborate closely with teams in networking, platform engineering, and data center operations to construct systems that can scale to meet the increasing demands of AI workloads.

SREs at Fluidstack are hands-on experts with profound systems knowledge and excellent communication skills. Your responsibilities will include addressing complex production challenges, deploying robust infrastructure, and continuously enhancing the stability and observability of our platform as we expand.

A typical day might involve:

  • Deploying clusters of over 1,000 GPUs using custom playbooks and adjusting these tools to deliver optimal solutions for our clients.
  • Validating the correctness and performance of our compute, storage, and networking infrastructure, while collaborating with providers to enhance these subsystems.
  • Migrating petabytes of data from public cloud platforms to local storage, efficiently and cost-effectively.
  • Troubleshooting issues across the stack, ranging from hardware problems like obstructed server fans to optimizing S3 data loaders across different regions.
  • Creating internal tools to reduce deployment times and enhance cluster reliability, including automation where customer benefits clearly surpass implementation costs.

This role will require participation in an on-call rotation of up to one week per month.

About Fluidstack

Fluidstack is at the forefront of creating infrastructure for advanced intelligence, partnering with prestigious AI labs and enterprises to revolutionize computing capabilities.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.