companyInferact logo

Performance Engineer - Member of Technical Staff, Kernel Engineering

InferactSan Francisco
Remote Full-time $200K/yr - $400K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Skills and QualificationsMinimum qualifications:Bachelor's degree in computer science, engineering, or a related field, or equivalent practical experience. Extensive experience in writing CUDA kernels or similar technologies (CuTeDSL, Triton, TileLang, Pallas). In-depth knowledge of GPU architecture, including memory hierarchy, warp scheduling, tiling, and tensor cores. Proficient in C++ and Python, with a proven track record of crafting high-performance code. Familiarity with profiling tools (Nsight, rocprof) and performance optimization methods. A strong passion for benchmarking and achieving incremental performance improvements. Preferred qualifications:Experience in ML-specific kernel optimization (FlashAttention, fused kernels). Understanding of quantization techniques (INT8, FP8, mixed-precision). Exposure to various accelerator platforms (NVIDIA, AMD, TPU, Intel). Knowledge of compiler technologies (LLVM, MLIR, XLA). Bonus qualifications:Contributions to vLLM or similar inference engines. Active participation in open-source GPU, ML systems, or compiler optimization projects. Authored in-depth technical articles on GPU optimization.

About the job

At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, significantly enhancing the speed and reducing the cost of AI inference. Our founders, the visionaries behind vLLM, have spent years bridging the gap between advanced models and cutting-edge hardware.

About the Role

We are seeking a skilled performance engineer dedicated to maximizing the computational efficiency of modern accelerators. In this role, you'll develop kernels and implement low-level optimizations that position vLLM as the fastest inference engine globally. Your contributions will be pivotal as your code will execute across a broad spectrum of hardware accelerators, from NVIDIA GPUs to the latest silicon innovations. You'll collaborate closely with hardware vendors to ensure we fully leverage the capabilities of each new generation of chips.

About Inferact

Inferact is dedicated to revolutionizing AI inference through vLLM, aiming to make it the fastest and most affordable engine available. Leveraging years of expertise from its founders and core maintainers, Inferact operates at the crucial intersection of advanced AI models and state-of-the-art hardware.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.