About the job
ABOUT BASETEN
Baseten is at the forefront of powering mission-critical AI inference for some of the most innovative companies globally, including Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer. We integrate cutting-edge applied AI research with a flexible infrastructure and intuitive developer tools to empower companies at the leading edge of AI to deploy sophisticated models effectively. With our recent $300M Series E funding round—supported by prominent investors such as BOND, IVP, Spark Capital, Greylock, and Conviction—we are rapidly expanding. Join our dynamic team and contribute to creating an essential platform for engineers to launch AI products with ease.
THE ROLE
As a Site Reliability Engineer, you will design and implement resilient systems and processes that ensure our infrastructure is scalable, reliable, and efficient. Your responsibilities will encompass everything from automating deployments and monitoring systems to enhancing performance and managing incidents effectively.
Collaboration is key; you will work closely with our users to understand their challenges in operationalizing machine learning, facilitating their onboarding onto our platform, and leveraging these insights to inform improvements to Baseten.
EXAMPLE INITIATIVES
As part of our Infrastructure team, you will engage in exciting projects such as:
- Innovative multi-cloud capacity management
- Optimizing inference on B200 GPUs
- Implementing multi-node inference
- Utilizing fractional H100 GPUs for efficient model serving
RESPONSIBILITIES
- Design and maintain scalable infrastructures to support the deployment and operational needs of machine learning models.
- Establish standards and best practices to enhance reliability and performance across the infrastructure.
- Proactively identify and resolve reliability issues using monitoring and alerting systems.
- Collaborate with cross-functional teams to apply best practices in infrastructure management and incident response.
- Create automation scripts to streamline processes and reduce manual intervention.

