Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Experience
Qualifications
Required QualificationsBachelor's degree or equivalent experience in computer science, engineering, or a related field.In-depth understanding of transformer architectures and their derivatives.Proficient programming skills in Python, with a strong background in PyTorch internals.Experience with LLM inference systems (e.g., vLLM, TensorRT-LLM, SGLang, TGI).Ability to interpret and implement model architectures and inference techniques as presented in academic papers.Proven capability to produce high-performance, maintainable code and troubleshoot complex machine learning codebases.Preferred QualificationsComprehensive knowledge of KV-cache memory management, prefix caching, and hybrid model serving.Familiarity with reinforcement learning frameworks and algorithms for large language models.Experience in multimodal inference across various media types (audio, image, video, text).Previous contributions to open-source machine learning or systems infrastructure projects.Additionally, bonus points if you have:Successfully implemented core features in vLLM or other inference engine projects.Contributed to vLLM integrations (e.g., verl, OpenRLHF, Unsloth, LlamaFactory).Authored widely-shared technical blogs or side projects focusing on vLLM or LLM inference.
About the job
At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, revolutionizing AI progress by making inference both more accessible and efficient. Our founding team consists of the original creators and key maintainers of vLLM, positioning us uniquely at the nexus of cutting-edge models and advanced hardware.
Role Overview
We are seeking a passionate inference runtime engineer eager to explore and expand the frontiers of LLM and diffusion model serving. As models evolve and grow in complexity with new architectures like mixture-of-experts and multimodal designs, the demand for innovative solutions in our inference engine intensifies. This role places you at the heart of vLLM, where you will enhance model execution across a variety of hardware platforms and architectures. Your contributions will have a direct influence on the future of AI inference.
About Inferact
Inferact is dedicated to advancing the field of artificial intelligence through innovative solutions in inference technology. Our team, comprised of the original architects of vLLM, is committed to shaping the future of AI by creating tools that make inference faster and more cost-effective.
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Experience
Qualifications
Required QualificationsBachelor's degree or equivalent experience in computer science, engineering, or a related field.In-depth understanding of transformer architectures and their derivatives.Proficient programming skills in Python, with a strong background in PyTorch internals.Experience with LLM inference systems (e.g., vLLM, TensorRT-LLM, SGLang, TGI).Ability to interpret and implement model architectures and inference techniques as presented in academic papers.Proven capability to produce high-performance, maintainable code and troubleshoot complex machine learning codebases.Preferred QualificationsComprehensive knowledge of KV-cache memory management, prefix caching, and hybrid model serving.Familiarity with reinforcement learning frameworks and algorithms for large language models.Experience in multimodal inference across various media types (audio, image, video, text).Previous contributions to open-source machine learning or systems infrastructure projects.Additionally, bonus points if you have:Successfully implemented core features in vLLM or other inference engine projects.Contributed to vLLM integrations (e.g., verl, OpenRLHF, Unsloth, LlamaFactory).Authored widely-shared technical blogs or side projects focusing on vLLM or LLM inference.
About the job
At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, revolutionizing AI progress by making inference both more accessible and efficient. Our founding team consists of the original creators and key maintainers of vLLM, positioning us uniquely at the nexus of cutting-edge models and advanced hardware.
Role Overview
We are seeking a passionate inference runtime engineer eager to explore and expand the frontiers of LLM and diffusion model serving. As models evolve and grow in complexity with new architectures like mixture-of-experts and multimodal designs, the demand for innovative solutions in our inference engine intensifies. This role places you at the heart of vLLM, where you will enhance model execution across a variety of hardware platforms and architectures. Your contributions will have a direct influence on the future of AI inference.
About Inferact
Inferact is dedicated to advancing the field of artificial intelligence through innovative solutions in inference technology. Our team, comprised of the original architects of vLLM, is committed to shaping the future of AI by creating tools that make inference faster and more cost-effective.