company

Technical Staff Member - Inference & Reinforcement Learning Systems

Magic.devSan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Candidates should possess a robust background in software engineering principles, particularly in distributed systems. Experience in developing or managing large-scale inference or training systems is essential. A nuanced understanding of GPU execution constraints and memory management is crucial, alongside demonstrated experience in diagnosing performance challenges within production machine learning environments. Prospective applicants should be adept at evaluating system-level trade-offs related to latency, throughput, and operational costs.

About the job

At Magic, we are driven by our mission to develop safe Artificial General Intelligence (AGI) that propels humanity forward in addressing the most critical challenges. We firmly believe that the future of safe AGI lies in automating research and code generation, allowing us to enhance models and tackle alignment issues more effectively than humans alone can manage. Our innovative approach combines cutting-edge pre-training, domain-specific reinforcement learning (RL), ultra-long context, and efficient inference-time computation to realize this vision.

Position Overview

As a Software Engineer within the Inference & RL Systems team, you will play a pivotal role in designing and managing the distributed systems that enable our models to function seamlessly in production, supporting extensive post-training workflows.

This position operates at the intersection of model execution and distributed infrastructure, focusing on systems that influence inference latency, throughput, stability, and the reliability of RL and post-training training loops.

Our long-context models impose significant execution demands, including KV-cache scaling, managing memory constraints for lengthy sequences, batching strategies, long-horizon trajectory rollouts, and ensuring consistent throughput under real-world workloads. You will be responsible for the infrastructure that ensures both production inference and large-scale RL iterations are efficient and dependable.

Key Responsibilities

  • Craft and scale high-performance inference serving systems.

  • Optimize KV-cache management, batching methods, and scheduling processes.

  • Enhance throughput and latency for long-context tasks.

  • Develop and sustain distributed RL and post-training infrastructure.

  • Boost reliability across rollout, evaluation, and reward pipelines.

  • Automate fault detection and recovery mechanisms for serving and RL systems.

  • Analyze and eliminate performance bottlenecks across GPU, networking, and storage components.

  • Collaborate with Kernel and Research teams to ensure alignment between execution systems and model architecture.

Qualifications

  • Solid foundation in software engineering and distributed systems.

  • Proven experience in building or managing large-scale inference or training systems.

  • In-depth understanding of GPU execution constraints and memory trade-offs.

  • Experience troubleshooting performance issues in production machine learning systems.

  • Capability to analyze system-level trade-offs between latency, throughput, and cost.

About Magic.dev

Magic.dev is at the forefront of AGI research, dedicated to creating safe AI solutions that significantly enhance human progress on pressing global issues. By leveraging advanced automated research techniques, we aim to redefine the capabilities of AI in a safe and responsible manner, making meaningful contributions to society.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.