About the job
At Magic, we are dedicated to creating safe artificial general intelligence (AGI) that propels humanity forward in tackling the most pressing global challenges. We believe that the most effective route to achieving safe AGI involves automating the research and code generation processes to enhance models and resolve alignment issues more reliably than humans can achieve independently. Our methodology incorporates cutting-edge pre-training at scale, domain-specific reinforcement learning (RL), ultra-long context capabilities, and optimized inference-time computations.
Role Overview
In your role as a Software Engineer on the Pre-training Systems team, you will be responsible for designing and managing the distributed infrastructure necessary for training Magic’s long-context models at scale.
This position emphasizes large-scale model training utilizing extensive GPU clusters. You will operate at the intersection of deep learning and distributed systems, ensuring that training processes are efficient, reliable, and reproducible under extreme conditions.
Magic’s long-context models present complex systems challenges, such as sustained memory usage, communication overhead across thousands of devices, long-duration jobs requiring fault tolerance, and efficient sequence packing within hardware limitations. You will take ownership of the systems that ensure large-scale pre-training is both stable and rapid.
Your Contributions
Scale distributed training across large GPU clusters, implementing data, tensor, and pipeline parallelism.
Optimize communication strategies and gradient synchronization.
Enhance checkpointing, fault tolerance, and job recovery mechanisms.
Profile and resolve performance bottlenecks across computing, networking, and storage.
Advance experiment reproducibility and orchestration workflows.
Boost hardware utilization and overall training throughput.
Collaborate with Kernel and Research teams to align model architecture with system capabilities.
Qualifications We Seek
Solid foundation in software engineering and distributed systems.
Experience with training large models in multi-node GPU environments.
In-depth understanding of parallelism techniques and performance trade-offs.
Experience in debugging cross-layer issues within production ML systems.
Demonstrated ownership mentality and capability to manage critical infrastructure.
Proven track record in enhancing the performance or reliability of large-scale systems.

