About the job
- Manage and operate a Kubernetes-based platform.
- Daily management of clusters and nodes, including upgrades, patches, node cordon/drain, and scaling.
- Oversee persistent storage solutions (e.g., CSI/Longhorn) with basic capacity planning.
- Support reliability and Service Level Objectives (SLOs).
- Participate in the establishment and monitoring of Service Level Indicators (SLIs) and SLOs.
- Track error budgets, providing feedback on incidents and trends to the team.
- Engage in observability and incident management.
- Utilize and configure monitoring and logging systems (dashboards, alerts).
- Participate in on-call rotation: handle alerts and resolve incidents based on runbooks.
- Facilitate automation and manage runbooks.
- Support deployment and configuration automation (CI/CD, Git-based processes).
- Create and maintain runbooks and operational documentation.
- Collaborate across various organizational units.
- Work collaboratively with development, operations, and business stakeholders on a daily basis.
- Provide suggestions for improving processes and platform reliability.

