companyCrusoe logo

Senior Staff Cloud Support Engineer at Crusoe | San Francisco, CA

CrusoeSan Francisco, CA - US
On-site Full-time $180K/yr - $220K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Senior

Qualifications

Qualifications- Extensive experience with Linux systems, Kubernetes, and networking protocols.- Proven track record in designing and implementing high-reliability cloud infrastructures.- Strong analytical skills and experience in incident management and root cause analysis.- Familiarity with AI/ML infrastructure and performance optimization techniques.- Excellent communication skills and the ability to work collaboratively across teams.

About the job

At Crusoe, we are on a mission to revolutionize the future by accelerating the abundance of energy and intelligence. We are building the foundational engine that empowers individuals to create bold innovations with AI while ensuring sustainability, speed, and scalability.

Join us in the forefront of the AI revolution with cutting-edge sustainable technology. You will play a pivotal role in driving meaningful innovation, making a significant impact, and collaborating with a team that is leading the way in responsible, transformative cloud infrastructure.

About the Role
As a Senior Staff Cloud Support Engineer, you will serve as a technical expert within Crusoe Cloud and significantly enhance the efforts of our Customer Experience, SRE, Networking, Fleet, and Product teams. Your role transcends basic ticket resolution; you will design reliability frameworks, influence architectural decisions, mentor senior engineers, and safeguard revenue by averting large-scale incidents. With profound expertise in Linux systems, Kubernetes, networking, and AI/ML infrastructure, you will apply your knowledge with a strong focus on customer satisfaction. You will be comfortable navigating uncertainty, leading incident responses, and shaping the global scaling of high-performance AI infrastructure.

Key Responsibilities

  • Act as the top escalation point for complex P1/P0 incidents.

  • Lead cross-functional investigations into root causes involving compute, networking (IB/RDMA/RoCE), storage, and orchestration layers.

  • Collaborate with SRE and Software teams (Storage, Networking, Compute, K8) to devise systemic solutions rather than temporary fixes.

Reliability Architecture

  • Design and enhance node validation, burn-in processes, performance baselining, and release readiness.

  • Influence Kubernetes architecture, workload orchestration (Slurm, Terraform), and AI/ML cluster stability.

  • Minimize MTTR and prevent incident recurrence through structural enhancements.

AI/ML Infrastructure Expertise

  • Troubleshoot NCCL, IB, GPU driver/firmware issues, and distributed training failures.

  • Support complex AI workloads (training + inference) through performance tuning and observability enhancements.

Customer-Facing Authority

  • Act as a senior technical advisor during high-stakes customer incidents.

About Crusoe

Crusoe is at the forefront of energy and intelligence innovation, dedicated to crafting sustainable technologies that empower ambitious AI-driven creativity. By joining Crusoe, you will be part of a team that is redefining the landscape of cloud infrastructure with a commitment to responsibility and transformation.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.