About the job
About Fluidstack
At Fluidstack, we are pioneering the infrastructure for advanced intelligence. Collaborating with leading AI laboratories, governmental bodies, and enterprises such as Mistral, Poolside, Black Forest Labs, and Meta, we aim to unlock computational power at unprecedented speeds.
Our mission is urgent: to turn Artificial General Intelligence (AGI) into a tangible reality. Our team is driven, dedicated to delivering top-tier infrastructure, and we treat the outcomes of our customers as if they were our own, taking immense pride in the systems we develop and the trust we establish. If you are purpose-driven, passionate about excellence, and ready to work diligently to propel the future of intelligence, we invite you to join us in shaping what comes next.
About the Role
As a Senior / Staff Site Reliability Engineer (SRE) at Fluidstack, you will be central to our infrastructure, working across software, hardware, and operations to ensure the reliability and performance of our global GPU cloud.
You will collaborate closely with teams in networking, platform engineering, and data center operations to construct systems that can scale to meet the increasing demands of AI workloads.
SREs at Fluidstack are hands-on experts with profound systems knowledge and excellent communication skills. Your responsibilities will include addressing complex production challenges, deploying robust infrastructure, and continuously enhancing the stability and observability of our platform as we expand.
A typical day might involve:
- Deploying clusters of over 1,000 GPUs using custom playbooks and adjusting these tools to deliver optimal solutions for our clients.
- Validating the correctness and performance of our compute, storage, and networking infrastructure, while collaborating with providers to enhance these subsystems.
- Migrating petabytes of data from public cloud platforms to local storage, efficiently and cost-effectively.
- Troubleshooting issues across the stack, ranging from hardware problems like obstructed server fans to optimizing S3 data loaders across different regions.
- Creating internal tools to reduce deployment times and enhance cluster reliability, including automation where customer benefits clearly surpass implementation costs.
This role will require participation in an on-call rotation of up to one week per month.

