companyDatabricks logo

Software Engineer - GenAI Inference at Databricks | San Francisco

DatabricksSan Francisco, California
On-site Full-time $142.2K/yr - $204.6K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

What We Look For PhD, Master's, or Bachelor's degree in Computer Science or a related field. A robust software engineering background with at least 3 years of experience in performance-critical systems. A solid grasp of ML inference internals, including attention mechanisms, MLPs, recurrent modules, quantization, and sparse operations. Hands-on experience with CUDA, GPU programming, and key libraries such as cuBLAS, cuDNN, and NCCL. Comfortable designing and operating distributed systems, including RPC frameworks, queuing, RPC batching, sharding, and memory partitioning. Proven ability to identify and resolve performance bottlenecks across various layers, including kernel, memory, networking, and scheduling. Experience in building instrumentation, tracing, and profiling tools for ML models. Ability to work closely with ML researchers to translate innovative model ideas into production-ready systems.

About the job

About This Role

Join Databricks as a Software Engineer focused on GenAI inference, where you will play a pivotal role in designing, developing, and enhancing the inference engine that drives our Foundation Model API. Collaborating at the intersection of research and production, you will ensure our large language model (LLM) serving systems are optimized for speed, scalability, and efficiency. Your contributions will span the entire GenAI inference stack, from kernels and runtimes to orchestration and memory management.

What You Will Do

  • Participate in the design and implementation of the inference engine, collaborating on a model-serving stack tailored for large-scale LLM inference.
  • Work closely with researchers to integrate new model architectures or features such as sparsity, activation compression, and mixture-of-experts into the engine.
  • Optimize latency, throughput, memory efficiency, and hardware utilization across GPUs and other accelerators.
  • Build and maintain tools for instrumentation, profiling, and tracing to identify bottlenecks and inform optimization efforts.
  • Develop scalable routing, batching, scheduling, memory management, and dynamic loading mechanisms for inference workloads.
  • Ensure reliability, reproducibility, and fault tolerance in inference pipelines, including A/B launches, rollback, and model versioning.
  • Integrate with federated and distributed inference infrastructure, orchestrating across nodes, balancing load, and managing communication overhead.
  • Engage in cross-functional collaboration with platform engineers, cloud infrastructure, and security/compliance teams.
  • Document and share insights, contributing to internal best practices and open-source initiatives as appropriate.

About Databricks

Databricks is a leader in data and AI, providing a unified platform for data engineering, machine learning, and analytics. Our innovative solutions empower organizations to leverage their data for transformative insights and decision-making.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.