companyDatabricks logo

Staff Software Engineer - GenAI Performance and Kernel

DatabricksSan Francisco, California
On-site Full-time $190.9K/yr - $232.8K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

What We Look For BS/MS/PhD in Computer Science or a related field. Substantial hands-on experience in writing and tuning compute kernels (using CUDA, Triton, OpenCL, LLVM IR, assembly, or similar technologies) for machine learning workloads. In-depth understanding of GPU and accelerator architecture, including warp structure, memory hierarchy (global, shared, register, L1/L2 caches), tensor cores, scheduling, and SM occupancy. Experience with advanced optimization techniques, including tiling, blocking, software pipelining, vectorization, fusion, loop transformations, and auto-tuning. Familiarity with machine learning-specific kernel libraries (cuBLAS, cuDNN, etc.) is preferred.

About the job

P-1285

About This Role

Join our dynamic team at Databricks as a Staff Software Engineer specializing in GenAI Performance and Kernel. In this pivotal role, you will take charge of designing, implementing, and optimizing high-performance GPU kernels that drive our GenAI inference stack. Your expertise will lead the development of finely-tuned, low-level compute paths, balancing hardware efficiency with versatility, while mentoring fellow engineers in the intricacies of kernel-level performance engineering. Collaborating closely with machine learning researchers, systems engineers, and product teams, you will elevate the forefront of inference performance at scale.

What You Will Do

  • Lead the design, implementation, benchmarking, and maintenance of essential compute kernels (such as attention, MLP, softmax, layernorm, memory management) tailored for diverse hardware backends (GPU, accelerators).
  • Steer the performance roadmap for kernel-level enhancements, focusing on areas like vectorization, tensorization, tiling, fusion, mixed precision, sparsity, quantization, memory reuse, scheduling, and auto-tuning.
  • Integrate kernel optimizations seamlessly with higher-level machine learning systems.
  • Develop and uphold profiling, instrumentation, and verification tools to identify correctness, performance regressions, numerical discrepancies, and hardware utilization inefficiencies.
  • Conduct performance investigations and root-cause analyses to address inference bottlenecks, such as memory bandwidth, cache contention, kernel launch overhead, and tensor fragmentation.
  • Create coding patterns, abstractions, and frameworks to modularize kernels for reuse, cross-backend compatibility, and maintainability.
  • Influence architectural decisions to enhance kernel efficiency (including memory layout, dataflow scheduling, and kernel fusion boundaries).
  • Guide and mentor fellow engineers focused on lower-level performance, conducting code reviews and establishing best practices.
  • Collaborate with infrastructure, tooling, and machine learning teams to implement kernel-level optimizations in production and assess their impacts.

About Databricks

Databricks is at the forefront of innovation, enabling organizations to harness the power of data and artificial intelligence. Our cutting-edge platform integrates data engineering, machine learning, and analytics, empowering teams to collaborate and drive transformational results.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.