company

Technical Staff Member - Supercomputing Platform & Infrastructure

magic.devSan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Strong foundation in systems engineering principles. Extensive hands-on experience with Terraform, including module design, state management, environment isolation, and large-scale implementations.

About the job

At Magic, our mission is to create safe AGI that propels humanity forward in addressing the world’s most critical challenges. We believe that the key to achieving safe AGI lies in automating research and code generation to enhance models and resolve alignment issues more effectively than humans alone. Our unique approach integrates frontier-scale pre-training, domain-specific reinforcement learning, ultra-long context, and inference-time computation to realize this vision.

Role Overview

As a vital member of our Supercomputing Platform & Infrastructure team, you will be instrumental in designing, constructing, and managing the extensive GPU infrastructure that underpins Magic’s model training and inference processes.

A key aspect of your role will involve leveraging Terraform-driven infrastructure-as-code methodologies to build and maintain our infrastructure, ensuring reproducibility, reliability, and operational clarity across clusters comprising thousands of GPUs.

Magic’s long-context models exert continuous demands on compute, networking, and storage systems. The infrastructure must support long-running distributed jobs, high-throughput data movement, and stringent availability requirements, necessitating designs that are automated, observable, and resilient. You will take ownership of the systems and IaC foundations that facilitate these capabilities.

This position has the potential to expand into broader responsibilities encompassing supercomputing platform architecture, influencing how Magic scales GPU clusters and enhances infrastructure reliability as model workloads expand.

Key Responsibilities

  • Design and manage large-scale GPU clusters for model training and inference.

  • Construct and sustain infrastructure utilizing Terraform across both cloud and hybrid environments.

  • Develop modular, scalable IaC frameworks for provisioning compute, networking, and storage resources.

  • Enhance deployment reproducibility, maintain environment consistency, and ensure operational safety.

  • Optimize networking and storage architectures for high-throughput AI workloads.

  • Automate fault detection and recovery mechanisms across distributed clusters.

  • Diagnose complex cross-layer issues involving hardware, drivers, networking, storage, operating systems, and cloud environments.

  • Enhance observability, monitoring, and reliability of essential platform systems.

Qualifications

  • Strong foundation in systems engineering principles.

  • Extensive hands-on experience with Terraform, including module design, state management, environment isolation, and large-scale implementations.

About magic.dev

Magic.dev is dedicated to building safe AGI that accelerates human progress in solving the world's most pressing challenges. Our innovative approach integrates advanced techniques to ensure the reliability and effectiveness of AI development.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.