companyFluidstack logo

System Engineer, GPU Fleet at Fluidstack | New York, NY

FluidstackNew York, NY
On-site Full-time $200K/yr - $300K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Basic QualificationsBachelor's degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience). Proven experience in managing large-scale GPU infrastructure. Strong troubleshooting skills and experience with automation tools. Ability to work collaboratively in a fast-paced environment.

About the job

About Fluidstack

At Fluidstack, we are pioneering the infrastructure that powers advanced artificial intelligence. Collaborating with leading AI laboratories, government entities, and major corporations—including Mistral, Poolside, Black Forest Labs, and Meta—we aim to deliver compute capabilities at unparalleled speeds.

Our mission is to expedite the realization of Artificial General Intelligence (AGI). Our team is driven by a sense of urgency and is dedicated to providing top-tier infrastructure. We view our clients' success as our own and take pride in the robust systems we create and the trust we cultivate. If you are inspired by meaningful work, strive for excellence, and are prepared to exert yourself to advance the future of intelligence, we invite you to join us in shaping what lies ahead.

About the Role

In the capacity of a System Engineer for our GPU Fleet, you will oversee, operate, and optimize our large-scale GPU compute infrastructure, which is essential for AI/ML training and inference processes. Your role will ensure the high availability, performance, and reliability of our GPU server fleet through automation, monitoring, troubleshooting, and collaboration with hardware engineering, platform teams, and data center operations.

Key Responsibilities

  • Maintain and operate a vast GPU server fleet (H100, B200, GB200) catering to AI/ML workloads; continuously monitor system health, performance, and utilization to ensure maximum uptime and adherence to SLA.

  • Conduct hands-on troubleshooting and root cause analysis for complex hardware, firmware, operating system, and application issues across GPU clusters; collaborate with vendors and hardware teams to rectify systemic failures.

  • Create and sustain automation scripts for efficient provisioning, configuration management, monitoring, and remediation on a large scale.

  • Enhance tools for GPU health assessments, performance diagnostics, driver validation, and automated recovery processes.

  • Implement server provisioning, configuration, firmware updates, and OS installations utilizing automation frameworks; manage lifecycle operations encompassing deployment, maintenance, and decommissioning.

  • Engage in 24x7 on-call rotation; respond to production incidents and coordinate resolution efforts with cross-functional teams, including data center operations, network engineering, and application teams.

  • Lead post-incident reviews, document root causes, and spearhead continuous improvement initiatives focused on automation, reliability, monitoring, and operational efficiency.

About Fluidstack

Fluidstack is at the forefront of building the infrastructure necessary for the next generation of intelligence. By partnering with leading AI labs and enterprises, we are dedicated to creating innovative computing solutions that drive the future of artificial intelligence.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.