About the job
At Crusoe, our mission is to drive the evolution of energy and intelligence. We are developing the technology that fuels a future where individuals can ambitiously harness AI capabilities without compromising on scale, speed, or sustainability.
Join us in revolutionizing AI with sustainable solutions at Crusoe. In this role, you will be at the forefront of innovation, making a significant impact while collaborating with a team that is shaping the future of responsible and transformative cloud infrastructure.
About This Role:
We are looking for a dedicated Hardware Production/Sustaining Engineer to enhance Crusoe's Hardware Systems Engineering team. This position is critical for bridging essential skill gaps in debugging, validation, and production support for high-performance computing systems. You will manage the entire hardware lifecycle—from prototype initiation to large-scale production—focusing on automation, deep troubleshooting, and reliability within Crusoe Cloud’s GPU- and CPU-oriented infrastructure.
Your collaboration with cross-functional teams will be vital in supporting, debugging, and enhancing hardware platforms on a large scale, specifically targeting PCIe, InfiniBand, and NVMe/storage, which have been highlighted as key areas for expanded expertise. Your contributions will directly influence Crusoe’s capability to deploy and maintain sustainable, AI-driven computing systems that deliver exceptional performance and reliability.
Your Responsibilities Will Include:
Leading the complete hardware development and sustaining lifecycle, encompassing feasibility studies, bring-up, validation, deployment, and ongoing production support.
Creating and sustaining automation frameworks and scripts for hardware testing, diagnostics, and continual reliability enhancements.
Executing in-depth troubleshooting and debugging across:
PCIe (including link training, topology, and performance issues)
InfiniBand (focusing on fabric debugging, throughput, and connectivity challenges)
NVMe/storage (addressing performance bottlenecks, firmware interactions, and failure analyses)
Performing extensive system validation and characterization for GPU, CPU, and high-performance computing platforms.
Assisting in end-to-end integration and solution testing to guarantee that Crusoe Cloud products fulfill performance, reliability, and scalability standards.
Collaborating with teams across mechanical, thermal, firmware, software, and manufacturing domains to troubleshoot and enhance system performance.

