About the job
About ClickHouse
Ranked among the 2025 Forbes Cloud 100, ClickHouse stands as a leading innovator in the private cloud sector. With a customer base exceeding 3,000 and an annual recurring revenue (ARR) growth of over 250% year-on-year, we excel in real-time analytics, data warehousing, observability, and AI workloads.
Our recent $400 million Series D funding round underscores our rapid growth and momentum. In just three months, renowned clients like Capital One, Lovable, Decagon, Polymarket, and Airwallex have adopted or expanded their use of our platform. They join industry giants such as Meta, Cursor, Sony, and Tesla who rely on our technology.
We invite you to join us on our mission to revolutionize the way organizations harness their data!
About the Role
As we aim to provide our customers with dependable and secure services, we are expanding our Site Reliability Engineering team. In this role, you will lead initiatives to guarantee the reliability, availability, scalability, and performance of our cloud infrastructure. Collaborating with teams across Control Plane, Data Plane, Core, Security, Support, and Operations, you will guide the design and implementation of scalable, secure, and resilient distributed systems. You will also oversee incident management, conduct post-mortem analyses, and drive continuous improvements in our Cloud services. Utilizing your software engineering skills, you will develop platforms and tools to enhance operational and engineering efficiencies in ClickHouse Cloud. This position offers a unique chance to significantly contribute to the high-performance, elastic, and limitless scale of ClickHouse Cloud.
What Will You Do?
- Work collaboratively with various engineering teams at ClickHouse to design and implement scalable, secure, and highly available systems.
- Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.
- Ensure comprehensive monitoring and alerting for all infrastructure components in ClickHouse Cloud, enabling timely incident detection and resolution.
- Refine incident response processes and conduct post-mortem analyses for outages, partnering with the support team to communicate effectively with affected customers.
- Continuously enhance the reliability and performance of our ClickHouse services.
- Plan and lead Chaos Engineering initiatives to identify potential vulnerabilities.

