About the job
About ClickHouse
Recognized on the 2025 Forbes Cloud 100 list, ClickHouse stands out as a leading innovator in the realm of private cloud technology. With a rapidly expanding customer base exceeding 3,000 and an astounding annual recurring revenue (ARR) growth of over 250% year-on-year, ClickHouse is at the forefront of real-time analytics, data warehousing, observability, and AI workloads.
Our recent $400M Series D financing round validates our sustained momentum. Notable clients such as Capital One, Lovable, Decagon, Polymarket, and Airwallex have recently adopted or expanded their use of our platform, joining a prestigious roster of AI pioneers and global brands including Meta, Cursor, Sony, and Tesla.
Join us in our mission to revolutionize the way companies leverage data!
About the Role
As we enhance our commitment to delivering dependable and secure services, we are expanding our Site Reliability Engineering team. In this role, you will spearhead initiatives to maintain and improve the reliability, availability, scalability, and performance of our cloud infrastructure. Collaborate across various teams, including Control Plane, Data Plane, Core, Security, Support, and Operations, to design and implement robust, secure, and highly available distributed systems. You will take charge of incident management and response processes, conducting blameless postmortems and driving continuous improvements in our Cloud services. Your software engineering expertise will be vital in developing tools and platforms to enhance operational and engineering efficiencies within ClickHouse Cloud. This is a unique opportunity to make a substantial impact on our high-performance, elastic ClickHouse Cloud.
Your Responsibilities
- Collaborate with diverse engineering teams at ClickHouse to architect and implement scalable, secure, and high-availability systems.
- Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.
- Ensure all infrastructure components within ClickHouse Cloud, including Data Plane, Control Plane, and ClickHouse Core, have effective monitoring and alerting systems in place for timely incident detection and resolution.
- Refine incident response processes and post-mortem analyses for outages in ClickHouse Cloud, including communication with impacted customers through the support team.
- Continuously enhance the reliability and performance of ClickHouse services.

