About the job
At Databricks, we are dedicated to empowering data teams to tackle the world's most challenging problems, from detecting security threats to advancing cancer drug development. We achieve this by offering the premier data and AI platform, allowing our customers to concentrate on their mission-critical challenges.
The Mosaic AI organization assists companies in developing AI models and systems utilizing their own data, employing technologies that range from training large language models (LLMs) from the ground up to employing advanced retrieval methods for enhanced generation. We pride ourselves on pushing the boundaries of science and operationalizing our innovations. Mosaic AI believes that a company’s AI models hold intrinsic value, akin to any other core intellectual property, and that superior AI models should be accessible to all.
Job Overview
As a research engineer in the Scaling team, you will stay abreast of the latest advancements in deep learning and pioneer new methodologies that surpass the current state of the art. You will collaborate with a diverse team of researchers and engineers, sharing insights and expertise. Most importantly, you will be passionate about our customers, striving to ensure their success in implementing cutting-edge LLMs and AI systems by translating our scientific knowledge into practical applications.
Your Impact
- Enhance performance through innovative optimization techniques, including kernel fusion, mixed precision, memory layout optimization, tiling strategies, and tensorization tailored for training-specific patterns.
- Design, implement, and optimize high-performance GPU kernels for training workloads, including attention mechanisms, custom layers, gradient computations, and activation functions, specifically for NVIDIA architectures.
- Create and implement distributed training frameworks for large language models, incorporating parallelism strategies (data, tensor, pipeline, ZeRO-based) and optimized communication patterns for gradient synchronization and collective operations.
- Profile, debug, and optimize comprehensive training workflows to pinpoint and resolve performance bottlenecks, utilizing memory optimization techniques such as activation checkpointing, gradient sharding, and mixed precision training.

