About the job
About Fluidstack
At Fluidstack, we are pioneering the infrastructure that powers advanced artificial intelligence. Collaborating with leading AI laboratories, government entities, and major corporations—including Mistral, Poolside, Black Forest Labs, and Meta—we aim to deliver compute capabilities at unparalleled speeds.
Our mission is to expedite the realization of Artificial General Intelligence (AGI). Our team is driven by a sense of urgency and is dedicated to providing top-tier infrastructure. We view our clients' success as our own and take pride in the robust systems we create and the trust we cultivate. If you are inspired by meaningful work, strive for excellence, and are prepared to exert yourself to advance the future of intelligence, we invite you to join us in shaping what lies ahead.
About the Role
In the capacity of a System Engineer for our GPU Fleet, you will oversee, operate, and optimize our large-scale GPU compute infrastructure, which is essential for AI/ML training and inference processes. Your role will ensure the high availability, performance, and reliability of our GPU server fleet through automation, monitoring, troubleshooting, and collaboration with hardware engineering, platform teams, and data center operations.
Key Responsibilities
Maintain and operate a vast GPU server fleet (H100, B200, GB200) catering to AI/ML workloads; continuously monitor system health, performance, and utilization to ensure maximum uptime and adherence to SLA.
Conduct hands-on troubleshooting and root cause analysis for complex hardware, firmware, operating system, and application issues across GPU clusters; collaborate with vendors and hardware teams to rectify systemic failures.
Create and sustain automation scripts for efficient provisioning, configuration management, monitoring, and remediation on a large scale.
Enhance tools for GPU health assessments, performance diagnostics, driver validation, and automated recovery processes.
Implement server provisioning, configuration, firmware updates, and OS installations utilizing automation frameworks; manage lifecycle operations encompassing deployment, maintenance, and decommissioning.
Engage in 24x7 on-call rotation; respond to production incidents and coordinate resolution efforts with cross-functional teams, including data center operations, network engineering, and application teams.
Lead post-incident reviews, document root causes, and spearhead continuous improvement initiatives focused on automation, reliability, monitoring, and operational efficiency.

