companySesame logo

ML Model Serving Engineer

SesameSan Francisco
On-site Full-Time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Qualifications

Required Qualifications:Proficient in a differentiable array computing framework, with a strong preference for PyTorch. Expertise in optimizing machine learning models for high-throughput and low-latency serving. Extensive systems programming experience, including familiarity with high-performance server systems and comfort with both VLLM internals and complex PyTorch codebases. Significant performance engineering background, including experience in bottleneck analysis and profiling low-level systems code. Committed to staying current with the latest advancements in model serving optimization. Preferred Qualifications:Understanding of high-performance LLM serving, with experience in VLLM, SGlang deployment, and their internals. Experience with public cloud platforms such as GCP, AWS, or Azure. Experience in deploying and scaling infrastructure for machine learning services.

About the job

About Sesame

At Sesame, we envision a transformative future where technology is seamlessly integrated into our lives, enabling computers to perceive, interact, and collaborate in ways that feel genuinely human. Our mission is to create innovative voice agents that become an integral part of daily experiences. Our talented team comprises pioneers from Oculus and Ubiquity6, alongside industry leaders from Meta, Google, and Apple, all bringing extensive expertise in both hardware and software. Join us in pioneering a world where computers are truly alive.

Key Responsibilities:

  • Enhance our model serving infrastructure, integrating a diverse range of LLM, speech, and vision models.

  • Collaborate with ML infrastructure and training engineers to develop a fast, cost-efficient, and reliable serving layer for our groundbreaking consumer product.

  • Adapt and extend existing LLM serving frameworks such as VLLM and SGLang, leveraging cutting-edge techniques for high-performance model serving.

  • Partner with the training team to uncover opportunities for accelerating model performance without compromising quality.

  • Implement strategies like in-flight batching, caching, and custom kernels to optimize inference speed.

  • Discover methods to minimize model initialization times while maintaining excellence in quality.

About Sesame

Sesame is at the forefront of redefining human-computer interaction. With a pioneering spirit and an innovative team, we are determined to build the next generation of voice agents that will seamlessly integrate into everyday life, making technology more intuitive and accessible. Join us in this exciting journey to shape a future where computers are more than just tools—they are partners in our daily experiences.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.