About the job
About the Role:
Production Engineer
The Production Engineer at Rubrik is essential for achieving operational excellence. This position involves managing alerts, addressing outages, and leading incident resolution as an Incident Manager. The ideal candidate will possess hands-on experience in maintaining highly available critical services across multi-cloud environments while continuously enhancing processes through automation and intelligent monitoring.
What You’ll Do:
- Become a vital part of a 24/7 Production Operations team dedicated to managing and supporting critical infrastructure and services in multi-cloud environments.
- Supervise staging and production environments to ensure optimal uptime and reliability.
- Implement and uphold comprehensive observability solutions for real-time monitoring, alerting, and metrics collection.
- Lead incident management initiatives by promptly responding to alerts and outages, coordinating teams for timely resolutions.
- Investigate recurring incidents to identify root causes, minimize toil, and enhance system resilience.
- Design and develop automation tools to proactively detect, triage, and remediate production issues.
- Maintain and update runbooks to facilitate incident response and address recurring issues.
- Exhibit strong decision-making skills under pressure, effectively managing critical situations with urgency and composure.

