About the job
At Runway ML, we are at the intersection of art and science, harnessing AI to create simulations of the world. Our vision is that world models represent the cutting edge of artificial intelligence. We believe that traditional language models alone fall short in addressing the most pressing global challenges, such as robotics, healthcare, and scientific innovation. Genuine advancement necessitates models that can learn from their interactions with the world, just like humans do. This iterative learning process can be substantially expedited through simulation rather than real-world experimentation.
World models pave the way for groundbreaking general-purpose simulations, revolutionizing storytelling, scientific exploration, and the pursuit of new horizons for humanity.
Our team is composed of imaginative, open-minded, compassionate, and driven individuals committed to making a difference. We aim to continuously achieve the remarkable, and our success hinges on cultivating an exceptional team. If you share this ambition, we invite you to connect with us.
Role Overview
We are seeking Research Engineers dedicated to enhancing the efficiency and speed of our world models without sacrificing their capabilities. Your responsibilities will include profiling, optimizing, and rearchitecting systems that transform research concepts into scalable, real-time models , directly influencing computational possibilities and the capabilities we can develop.
Key Responsibilities
- Optimize training throughput on expansive GPU clusters, enhancing MFU through custom kernels, mixed-precision strategies (FP8, BF16), memory-efficient attention mechanisms, and activation checkpointing.
- Design and uphold a distributed training infrastructure encompassing tensor parallelism, context parallelism, FSDP, and resilient multi-node configurations.
- Profile and accelerate inference pipelines for real-time multimodal generation, including CUDA graph compilation, KV cache optimization, operator fusion, and latency reduction.
- Optimize and scale our training infrastructure to bolster efficiency and reliability.
- Contribute across the entire stack, from low-level kernel optimizations to high-level model design.
Qualifications
- 4+ years of experience in systems engineering, machine learning infrastructure, or performance optimization specifically for deep learning.
- Proficiency in GPU kernel development (CUDA, Triton, CUTLASS) and familiarity with distributed systems (NCCL, collective communication, model parallelism).
- Understanding of machine learning framework internals (PyTorch, TensorFlow, etc.) and hands-on experience with optimization techniques.
