About the job
ABOUT BASETEN
At Baseten, we are at the forefront of AI innovation, providing critical inference solutions for leading AI companies like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer. Our platform combines advanced AI research, adaptable infrastructure, and intuitive developer tools, empowering organizations to deploy state-of-the-art models effectively. With rapid growth and a recent $300M Series E funding round backed by top-tier investors including BOND, IVP, Spark Capital, Greylock, and Conviction, we invite you to join our mission in building the platform of choice for engineers delivering AI products.
THE ROLE:
As a member of Baseten’s Model Performance (MP) team, you will play a pivotal role in ensuring our platform’s model APIs are not only fast and reliable but also cost-effective. Your primary focus will be on developing and optimizing the infrastructure that supports our hosted API endpoints for cutting-edge open-source models. This role involves working with distributed systems, model serving, and enhancing the developer experience. You will collaborate with a small, dynamic team at the intersection of product development, model performance, and infrastructure, defining how developers interact with AI models on a large scale.
RESPONSIBILITIES:
Design, develop, and maintain the Model APIs surface, focusing on advanced inference features such as structured outputs (JSON mode, grammar-constrained generation), tool/function calling, and multi-modal serving.
Profile and optimize TensorRT-LLM kernels, analyze CUDA kernel performance, create custom CUDA operators, and enhance memory allocation patterns for maximum efficiency across multi-GPU setups.
Implement performance improvements across various runtimes based on a deep understanding of their internals, including speculative decoding, guided generation for structured outputs, and custom scheduling algorithms for high-performance serving.
Develop robust benchmarking frameworks to evaluate real-world performance across diverse model architectures, batch sizes, sequence lengths, and hardware configurations.
Enhance performance across runtimes (e.g., TensorRT, TensorRT-LLM) through techniques such as speculative decoding, quantization, batching, and KV-cache reuse.
Integrate deep observability mechanisms (metrics, traces, logs) and establish repeatable benchmarks to assess speed, reliability, and quality.

