companyBaseten logo

Software Engineer - Model API's

BasetenSan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Qualifications:Proven experience in software engineering, particularly in API development and optimization. Strong understanding of distributed systems, model serving, and performance optimization techniques. Proficiency in CUDA programming and experience with TensorRT or similar technologies. Familiarity with benchmarking methodologies and performance analysis tools. Excellent problem-solving skills and the ability to work effectively in a collaborative team environment. Strong communication skills to articulate complex concepts to technical and non-technical stakeholders.

About the job

ABOUT BASETEN

At Baseten, we are at the forefront of AI innovation, providing critical inference solutions for leading AI companies like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer. Our platform combines advanced AI research, adaptable infrastructure, and intuitive developer tools, empowering organizations to deploy state-of-the-art models effectively. With rapid growth and a recent $300M Series E funding round backed by top-tier investors including BOND, IVP, Spark Capital, Greylock, and Conviction, we invite you to join our mission in building the platform of choice for engineers delivering AI products.

THE ROLE:

As a member of Baseten’s Model Performance (MP) team, you will play a pivotal role in ensuring our platform’s model APIs are not only fast and reliable but also cost-effective. Your primary focus will be on developing and optimizing the infrastructure that supports our hosted API endpoints for cutting-edge open-source models. This role involves working with distributed systems, model serving, and enhancing the developer experience. You will collaborate with a small, dynamic team at the intersection of product development, model performance, and infrastructure, defining how developers interact with AI models on a large scale.

RESPONSIBILITIES:

  • Design, develop, and maintain the Model APIs surface, focusing on advanced inference features such as structured outputs (JSON mode, grammar-constrained generation), tool/function calling, and multi-modal serving.

  • Profile and optimize TensorRT-LLM kernels, analyze CUDA kernel performance, create custom CUDA operators, and enhance memory allocation patterns for maximum efficiency across multi-GPU setups.

  • Implement performance improvements across various runtimes based on a deep understanding of their internals, including speculative decoding, guided generation for structured outputs, and custom scheduling algorithms for high-performance serving.

  • Develop robust benchmarking frameworks to evaluate real-world performance across diverse model architectures, batch sizes, sequence lengths, and hardware configurations.

  • Enhance performance across runtimes (e.g., TensorRT, TensorRT-LLM) through techniques such as speculative decoding, quantization, batching, and KV-cache reuse.

  • Integrate deep observability mechanisms (metrics, traces, logs) and establish repeatable benchmarks to assess speed, reliability, and quality.

About Baseten

Baseten is a pioneering platform that empowers the world's most innovative AI companies to leverage advanced inference solutions. With a commitment to enhancing the developer experience and optimizing model performance, we are rapidly growing and attracting significant investment to further our mission. Join us as we revolutionize the AI landscape by providing the tools that developers trust to bring their AI products to market effectively.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.