About the job
At Crusoe, we are on a mission to revolutionize the future by accelerating the abundance of energy and intelligence. We are building the foundational engine that empowers individuals to create bold innovations with AI while ensuring sustainability, speed, and scalability.
Join us in the forefront of the AI revolution with cutting-edge sustainable technology. You will play a pivotal role in driving meaningful innovation, making a significant impact, and collaborating with a team that is leading the way in responsible, transformative cloud infrastructure.
About the Role
As a Senior Staff Cloud Support Engineer, you will serve as a technical expert within Crusoe Cloud and significantly enhance the efforts of our Customer Experience, SRE, Networking, Fleet, and Product teams. Your role transcends basic ticket resolution; you will design reliability frameworks, influence architectural decisions, mentor senior engineers, and safeguard revenue by averting large-scale incidents. With profound expertise in Linux systems, Kubernetes, networking, and AI/ML infrastructure, you will apply your knowledge with a strong focus on customer satisfaction. You will be comfortable navigating uncertainty, leading incident responses, and shaping the global scaling of high-performance AI infrastructure.
Key Responsibilities
Act as the top escalation point for complex P1/P0 incidents.
Lead cross-functional investigations into root causes involving compute, networking (IB/RDMA/RoCE), storage, and orchestration layers.
Collaborate with SRE and Software teams (Storage, Networking, Compute, K8) to devise systemic solutions rather than temporary fixes.
Reliability Architecture
Design and enhance node validation, burn-in processes, performance baselining, and release readiness.
Influence Kubernetes architecture, workload orchestration (Slurm, Terraform), and AI/ML cluster stability.
Minimize MTTR and prevent incident recurrence through structural enhancements.
AI/ML Infrastructure Expertise
Troubleshoot NCCL, IB, GPU driver/firmware issues, and distributed training failures.
Support complex AI workloads (training + inference) through performance tuning and observability enhancements.
Customer-Facing Authority
Act as a senior technical advisor during high-stakes customer incidents.

