About the job
Veeam is recognized as the premier Data and AI Trust Company, dedicated to assisting organizations in comprehending, securing, and fortifying their data and AI systems. As the leading entity in data resilience and security posture management, Veeam is designed to address the convergence of identity, data, security, and AI risk. Our headquarters are in Seattle, and we operate in over 30 countries, safeguarding the data of more than 550,000 customers globally who rely on Veeam to maintain business continuity. Join us as we advance together, fostering growth, learning, and making a significant impact for some of the world’s most renowned brands.
We are seeking a Senior Software Engineer - Reliability to take on a pivotal role as a hands-on technical leader within our Site Reliability Engineering (SRE) team. In this position, you will mentor senior engineers, influence product development, and ensure that our operational systems are designed for reliability, scalability, and observability from the ground up.
Your responsibilities will include driving strategic initiatives, mentoring others in SRE practices, and defining architectural best practices across our platform. This role is crucial for aligning teams, maintaining high standards, and scaling SRE principles globally within Veeam.
Your tasks will include:
- Reliability Engineering & Resilience
- Design and enhance infrastructure to ensure high availability, fault tolerance, and scalability across public clouds, starting with Azure and planning expansion to other providers.
- Establish and uphold Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to define and enforce reliability goals.
- Lead incident response initiatives, conduct thorough analysis, facilitate blameless postmortems, and host sharing sessions to maximize learning throughout our engineering team, driving improvements across the socio-technical engineering ecosystem.
- Observability & Operational Excellence
- Promote deep observability practices, ensuring telemetry, logs, and metrics are effectively utilized to enhance our operational insights.

