About the job
At Crusoe, we're on a mission to transform the landscape of energy and intelligence. Our goal is to create an ecosystem where individuals can harness the power of AI to their fullest potential, all while prioritizing sustainability and scalability.
Join us in pioneering the AI revolution with innovative, sustainable technology. Your contributions will drive significant advancements and shape the future of responsible cloud infrastructure.
About the Role
As an Incident Manager, you will play a pivotal role in ensuring service reliability and maintaining customer confidence. Your efforts will directly influence our success by minimizing downtime and efficiently addressing critical incidents. You will oversee high-visibility incidents and customer escalations, guaranteeing quick and effective responses to intricate technical challenges.
In addition to immediate incident resolution, we aim to refine our incident management strategies to enhance customer experiences during crises and implement robust preventive measures thereafter. By utilizing data analytics, you will foster increased resiliency and reliability, ensuring that every incident serves as an opportunity for improvement in both our products and processes.
What You’ll Be Working On
Crisis Management & Data-Driven Resiliency
Lead incident responses for high-impact situations, ensuring minimal disruption to customer operations. You will be the steady force during crises, managing communications and strategies to uphold customer trust during outages or critical failures.
Leverage data analytics to identify incident trends, converting insights into actionable strategies that enhance system resiliency and reliability.
Formulate comprehensive incident response strategies. Emphasize prevention by conducting thorough post-incident reviews to address root causes and eliminate recurrences.
Technical Execution & Customer Support
Diagnose and resolve complex technical issues related to Infiniband, containerization, and distributed training.
Assist customers in implementing and optimizing their HPC infrastructure for maximum performance and efficiency.
Create and present training materials, including internal sessions, documentation, and knowledge base articles, to empower customers.

