About the job
At Crusoe, our mission is to propel the availability of energy and intelligence. We are developing the engine that empowers individuals to pursue ambitious projects with AI, all while upholding standards of scale, speed, and sustainability.
Join us in revolutionizing the AI landscape with sustainable technology. Here, you will spearhead significant innovations, create real-world impact, and collaborate with a team that is defining the future of responsible cloud infrastructure.
Position Overview:
As a Staff Software Engineer on the Model LifeCycle team, you will be instrumental in developing a robust managed platform that oversees the entire application development lifecycle, specifically focusing on the integration of Machine Learning models, including Large Language Models (LLMs).
Your Responsibilities:
Enhance systems for large foundation models through fine-tuning (SFT, PEFT, LoRA, adapters), including multi-node orchestration, checkpointing, failure recovery, and efficient scaling.
Design and sustain comprehensive training pipelines for Large Language Models.
Contribute to the development of distillation and reinforcement learning pipelines (e.g., preference optimization, policy optimization, reward modeling).
Create and uphold the infrastructure for agent execution.
Implement features for dataset, model, and experiment management: ensuring versioning, lineage tracking, evaluation, and reproducible fine-tuning at scale.
Collaboration and Impact:
Collaborate closely with Principal Engineers, product teams, and platform teams to implement core abstractions and APIs.
Participate in architectural decisions regarding training runtimes, scheduling, storage, and model lifecycle management.
Engage actively with the open-source LLM community.
This role offers considerable ownership — you will be pivotal in designing and implementing core systems.

