About the job
Join Moonlake, a pioneering company harnessing AI to develop immersive world simulations.
Role Overview
Enhancing Training Efficiency
Implement data loaders, fusion techniques, activation rematerialization, and gradient checkpointing.
Optimize training with FSDP/ZeRO/tensor+pipeline parallelism and NCCL tuning.
Improving GPU and Kernel Performance
Conduct Nsight profiling, develop Triton/CUDA kernels, and create fused operations.
Implement flash-attention style accelerations, sequence packing, and KV-cache optimizations.
Optimizing Inference
Focus on low-latency serving, continuous batching, and speculative decoding strategies.
Apply quantization methods (GPTQ/AWQ), distillation, and pruning techniques.
Infrastructure and Reliability
Manage SLURM/Kubernetes multi-node jobs and ensure checkpoint hygiene.
Maintain determinism, environment pinning, and effectively handle GPU failures.
Our dedicated team thrives on collaboration in our San Mateo office.

