companyConfluent logo

Staff Site Reliability Engineer - Incident Management & Reliability

ConfluentRemote, Ontario, Canada
Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

What You Will Do:Conduct analyses of systemic failure patterns and devise reliability enhancements to avert future incidents. Manage configurations, workflows, and integrations for tools like Rootly, PagerDuty, Jira, Confluence, and Slack. Establish and uphold SLO/SLA frameworks, utilizing error budgets to direct reliability investments. Lead the development of standards and practices to continuously improve incident response across all engineering teams. Review and refine customer-facing incident documentation (CRCAs) for clarity and quality. Create and implement training programs, mentoring teams through post-incident analyses. Collaborate with engineering leaders to enhance overall reliability and performance.

About the job

At Confluent, we are not just advancing technology; we are revolutionizing the flow of data and its potential applications. Our platform empowers businesses to utilize data in real-time, enabling them to adapt swiftly, innovate intelligently, and offer experiences that resonate with the fast-paced world around them.

We seek individuals who thrive in collaborative environments, who are unafraid to pose challenging questions, provide constructive feedback, and support one another. Our team is built on a foundation of curiosity and collective ambition, where egos take a backseat to team efforts.

Join us at Confluent as we unite as one team on our journey to enhance data streaming.

About the Role:

As a Staff Site Reliability Engineer specializing in Incident Management, you will play a crucial role in maintaining the reliability of Confluent Cloud, which processes millions of events per second across multiple cloud platforms like AWS, GCP, and Azure. You will leverage your deep systems thinking to preemptively address incidents that could disrupt our multi-cloud streaming services.

Your work will blend technical expertise with strategic program ownership, dedicating about 75% of your time to engineering tasks such as automating processes, refining tools, analyzing failure patterns, and enhancing reliability. The remaining 25% will focus on coaching and collaboration, guiding teams through post-incident reviews and refining our incident response methodologies.

You will be part of a global team that ensures continuous support, maintaining a sustainable workload through seamless transitions. This position falls within the Cloud Architecture and Reliability - Supportability division, a team committed to establishing and upholding reliability standards across our engineering efforts.

About Confluent

Confluent is at the forefront of data streaming technology, transforming how data is managed and utilized globally. Our innovative platform enables businesses to harness data in real-time, fostering rapid responses and intelligent solutions that align with dynamic market demands.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.