About the job
At Orgvue, we are at the forefront of organizational design and planning software, harnessing the transformative power of data visualization and modeling to help organizations become more adaptable and high-performing. Our platform empowers HR, finance, and business leaders to make swift, informed workforce decisions in an ever-evolving landscape.
Trusted by some of the world's largest enterprises and renowned management consulting firms, Orgvue enables organizations to visualize and proactively shape their futures. Headquartered in London, we also have offices in Philadelphia, The Hague, Toronto, and Sydney.
We are currently on the lookout for a Principal Site Reliability Engineer to join our team as a senior technical leader specializing in scaling and fortifying our AWS and Kubernetes-based infrastructure.
Role Overview
In this pivotal role, you will collaborate with product, platform, and operations teams to ensure our systems are reliable, observable, and resilient, even at scale. This position marries hands-on technical proficiency with strategic foresight, enabling us to cultivate a world-class reliability culture and a strong engineering framework for growth. We seek an individual with robust technical skills, exceptional communication abilities, and a passion for cross-team collaboration.
Key Responsibilities
- Establish and uphold SLOs, SLIs, and error budgets across vital services
- Design and execute a comprehensive cloud infrastructure and tooling strategy
- Elevate SRE practices organization-wide
- Implement effective observability metrics, logs, and traces using our observability tools
- Lead the team in creating automated, self-healing systems
- Manage and refine our incident response protocols, including on-call practices and a post-mortem culture
- Mentor engineers throughout the organization on reliability best practices, operational readiness, and scalable infrastructure
- Drive Infrastructure as Code (IaC) initiatives using Terraform, Kubernetes, CloudFormation, and GitOps methodologies
- Work closely with security, DevOps, and software teams to guarantee compliance, scalability, and operational excellence
- Assess and introduce tools, patterns, and practices that enhance the performance and reliability of our SaaS platform
Qualifications
- Proven experience leading SRE transformations
- Extensive hands-on expertise with Kubernetes (EKS preferred) in production settings
- Strong proficiency with AWS core services (EC2, EKS, RDS, S3, ALB/NLB, IAM, CloudWatch, etc.)
- Expertise in Infrastructure as Code utilizing tools such as Terraform, with familiarity in GitOps workflows
- Solid background in observability: metrics, visualization, logging, and tracing
- Underst...

