About the job
About Us
Braintrust is at the forefront of AI observability, seamlessly integrating evaluations and observability into a single workflow. Our platform empowers innovators by providing them with the critical insights needed to understand AI performance in production environments and the tools required to enhance it.
Recognized by leading companies such as Notion, Stripe, Zapier, Vercel, and Ramp, Braintrust enables teams to compare AI models, test prompts, and detect regressions, transforming production data into superior AI with each iteration.
Role Overview
We are seeking a talented Cloud Infrastructure Engineer to join our team and contribute to the development of a robust and scalable infrastructure. You will provide developers with a premium platform to deploy code efficiently and confidently. Your role will involve leading initiatives across Terraform, Kubernetes, CI/CD, observability, and support, significantly impacting Braintrust's internal operations and the self-hosted experiences of our customers.
This position is pivotal as you will manage our AWS environment while assisting customers in deploying our infrastructure on AWS, Azure, and GCP.
Your Responsibilities
Develop and maintain Terraform modules for both internal infrastructure and customer deployments.
Engage directly with customers via Slack to assist with self-hosting and troubleshoot infrastructure challenges, creating tools to simplify their support process.
Take ownership of our CI/CD pipeline, aiming to reduce build times, enhance failure visibility, and facilitate safer, quicker releases.
Centralize and scale observability through logs, metrics, dashboards, and alerts.
Collaborate with engineering teams to create and enhance a secure, developer-friendly infrastructure platform.
Support multi-cloud deployment strategies, primarily in AWS, while also extending support for Azure and GCP for our enterprise clientele.
Implement tools and automation to bolster deployment, rollback, and infrastructure reliability.
Ideal Candidate Profile
A minimum of 5 years of experience in DevOps, SRE, or Infrastructure Engineering roles.
In-depth knowledge of Terraform and experience with at least one major cloud provider, preferably AWS.
Proficient in Kubernetes, with capabilities in deploying, debugging, and scaling real workloads.
Strong programming skills in scripting languages like Python, Typescript, or Go.
Experience in supporting production systems and managing incidents effectively.
Comfortable working closely with customers in a support or deployment capacity.
Bonus: Familiarity with monitoring and logging tools, as well as knowledge of security best practices.

