About the job
P-1285
About This Role
Join Databricks as a Staff Software Engineer specializing in GenAI inference, where you will spearhead the architecture, development, and optimization of the inference engine that powers the Databricks Foundation Model API. Your role will be crucial in bridging cutting-edge research with real-world production requirements, ensuring exceptional throughput, minimal latency, and scalable solutions. You will work across the entire GenAI inference stack, including kernels, runtimes, orchestration, memory management, and integration with various frameworks and orchestration systems.
What You Will Do
- Take full ownership of the architecture, design, and implementation of the inference engine, collaborating on a model-serving stack optimized for large-scale LLM inference.
- Work closely with researchers to integrate new model architectures or features, such as sparsity, activation compression, and mixture-of-experts into the engine.
- Lead comprehensive optimization efforts focused on latency, throughput, memory efficiency, and hardware utilization across GPUs and other accelerators.
- Establish and uphold standards for building and maintaining instrumentation, profiling, and tracing tools to identify performance bottlenecks and drive optimizations.
- Design scalable solutions for routing, batching, scheduling, memory management, and dynamic loading tailored to inference workloads.
- Guarantee reliability, reproducibility, and fault tolerance in inference pipelines, including capabilities for A/B testing, rollbacks, and model versioning.
- Collaborate cross-functionally to integrate with federated and distributed inference infrastructure, ensuring effective orchestration across nodes, load balancing, and minimizing communication overhead.
- Foster collaboration with cross-functional teams, including platform engineers, cloud infrastructure, and security/compliance professionals.
- Represent the team externally through benchmarks, whitepapers, and contributions to open-source projects.
What We Look For
- A BS/MS/PhD in Computer Science or a related discipline.
- A solid software engineering background with 6+ years of experience in performance-critical systems.
- A proven ability to own complex system components and influence architectural decisions from conception to execution.
- A deep understanding of ML inference internals, including attention mechanisms, MLPs, recurrent modules, quantization, and sparse operations.
- Hands-on experience with CUDA, GPU programming, and essential libraries (cuBLAS, cuDNN, NCCL, etc.).
- A strong foundation in distributed systems design, including RPC frameworks, queuing, RPC batching, sharding, and memory partitioning.
- Demonstrated proficiency in diagnosing and resolving performance bottlenecks across multiple layers (kernel, memory, networking, scheduler).

