About the job
Join Fluidstack: Pioneering the Future of Intelligence
Fluidstack is at the forefront of building groundbreaking infrastructure that powers abundant intelligence. Collaborating with leading AI labs, government entities, and major enterprises such as Mistral, Poolside, Black Forest Labs, and Meta, we aim to unlock compute capabilities at unprecedented speeds.
Our mission is to transform the vision of AGI into reality, and our team is driven by a sense of urgency and commitment to excellence. We view our customers' success as our own and take immense pride in the robust systems we create and the trust we cultivate. If you are fueled by purpose, dedicated to achieving excellence, and eager to work diligently to propel the future of intelligence, we invite you to be part of our journey in building what’s next.
Role Overview
We are seeking a Product Manager to take charge of our managed services portfolio, which includes SLURM and Kubernetes control planes. In this role, you will shape the product vision and roadmap, guiding enterprises in deploying, managing, and scaling workloads on Fluidstack's cutting-edge infrastructure—from initial cluster provisioning through lifecycle management, observability, and optimization. This position lies at the intersection of infrastructure, developer experience, and operational excellence, collaborating closely with engineering, datacenter operations, and customer-focused teams to develop control plane capabilities that can scale to over 100,000 GPU megaclusters.
Key Responsibilities
Lead the product roadmap for managed SLURM and Kubernetes services, encompassing control plane architecture, autoscaling, multi-tenancy, and cluster lifecycle management.
Establish requirements for control plane performance, reliability, and availability, including API rate limits, etcd scaling, provisioning tiers, and failure recovery strategies.
Collaborate with engineering to develop automated provisioning workflows, health monitoring systems, and node lifecycle controllers that minimize downtime and maximize GPU efficiency.
Work alongside datacenter and networking teams to ensure that control plane infrastructure scales effortlessly across geographical regions and supports hybrid deployment models.
Drive strategic decisions on whether to build or integrate with ecosystem tools (such as Rancher, OpenShift, Slurm accounting, workload orchestrators) based on customer needs and competitive landscape.
Define metrics and SLAs for control plane uptime, API performance, scheduler throughput, and pod/job launch latency.
Conduct customer discovery sessions to identify pain points related to cluster management, job queueing, resource allocation, and multi-cluster orchestration.

