About the job
Key Responsibilities
Cluster Operations & Management
Oversee the management and upkeep of containerized clusters (Kubernetes, Docker) and open-source component clusters (Kafka, Redis, Elasticsearch) across various business units.
Guarantee the peak performance, scalability, and reliability of distributed systems.
Infrastructure Platform Development
Architect, develop, and enhance infrastructure operation platforms.
Create and maintain systems for infrastructure management, CI/CD pipelines, monitoring and alerting, and centralized logging.
Champion platform standardization and automation initiatives.
High Availability & Reliability
Ensure maximum uptime for production services through proactive monitoring and effective incident response.
Continuously refine service architecture, deployment strategies, and operational processes.
Implement and uphold SLA/SLO frameworks and reliability engineering best practices.
Automation & Process Improvement
Lead the creation of automated operations and maintenance systems.
Develop self-service tools and workflows to enhance team productivity.
Establish best practices for infrastructure as code and configuration management.

