About the job
At Magic, our mission is to create safe AGI that propels humanity forward in addressing the world’s most critical challenges. We believe that the key to achieving safe AGI lies in automating research and code generation to enhance models and resolve alignment issues more effectively than humans alone. Our unique approach integrates frontier-scale pre-training, domain-specific reinforcement learning, ultra-long context, and inference-time computation to realize this vision.
Role Overview
As a vital member of our Supercomputing Platform & Infrastructure team, you will be instrumental in designing, constructing, and managing the extensive GPU infrastructure that underpins Magic’s model training and inference processes.
A key aspect of your role will involve leveraging Terraform-driven infrastructure-as-code methodologies to build and maintain our infrastructure, ensuring reproducibility, reliability, and operational clarity across clusters comprising thousands of GPUs.
Magic’s long-context models exert continuous demands on compute, networking, and storage systems. The infrastructure must support long-running distributed jobs, high-throughput data movement, and stringent availability requirements, necessitating designs that are automated, observable, and resilient. You will take ownership of the systems and IaC foundations that facilitate these capabilities.
This position has the potential to expand into broader responsibilities encompassing supercomputing platform architecture, influencing how Magic scales GPU clusters and enhances infrastructure reliability as model workloads expand.
Key Responsibilities
Design and manage large-scale GPU clusters for model training and inference.
Construct and sustain infrastructure utilizing Terraform across both cloud and hybrid environments.
Develop modular, scalable IaC frameworks for provisioning compute, networking, and storage resources.
Enhance deployment reproducibility, maintain environment consistency, and ensure operational safety.
Optimize networking and storage architectures for high-throughput AI workloads.
Automate fault detection and recovery mechanisms across distributed clusters.
Diagnose complex cross-layer issues involving hardware, drivers, networking, storage, operating systems, and cloud environments.
Enhance observability, monitoring, and reliability of essential platform systems.
Qualifications
Strong foundation in systems engineering principles.
Extensive hands-on experience with Terraform, including module design, state management, environment isolation, and large-scale implementations.

