About the job
Location: Cambridge, UK. Candidates should live within commuting distance or be willing to relocate for onsite meetings as needed.
Role overview
Jagex seeks a Senior Cloud Reliability Engineer to help keep RuneScape’s services reliable, scalable, and responsive for players worldwide. This position sits within the Cloud Tech division and collaborates with Game, Central Tech, and Cloud Platform teams. The main focus is to drive reliability, observability, automation, and cloud-native practices across a hybrid-cloud environment.
This role combines hands-on production engineering with architectural decision-making. Projects include modernizing services, migrating workloads, and improving systems that directly affect the player experience. The team is committed to delivering reliable live services at scale.
What you will do
- Collaborate with game and development teams to move services toward cloud-native architectures, increasing resilience, security, and cost-effectiveness in live environments.
- Support migration of workloads from managed VPS environments to Jagex’s cloud platform, with a focus on modernization and uptime.
- Develop and refine SLIs, SLOs, and error-budget frameworks to measure and communicate service reliability across teams.
- Design and improve observability and alerting for logs, metrics, and traces, enabling faster detection and resolution of issues.
- Automate operational tasks such as scaling, failover, and deployments. Build self-healing systems to reduce manual intervention and improve recovery.
- Drive reliability improvements in Linux-based production systems, reusable Infrastructure as Code modules, and team codebases. Contribute to raising engineering standards within Cloud Tech.
Requirements
- Experience ensuring reliability for large-scale, internet-facing production services.
- Strong knowledge of AWS services, including VPC, EC2, ECS/EKS, ELB, ECR, Route53, KMS, IAM, and Systems Manager.
- Background in cloud-native design, workload modernization, and Infrastructure as Code (IaC).
- Hands-on experience with SLIs, SLOs, incident management, root cause analysis, and resilient system architecture.
- Practical experience with Debian-based Linux environments, managing VM fleets, and configuration management tools.
- Familiarity with observability platforms, CI/CD practices, containerization, and programming or scripting in Python or Java.
Benefits
- Comprehensive private healthcare, including dental plan
- Discretionary annual performance bonuses
- At least 6% pension contributions
- Life insurance coverage
- Enhanced family leave policies

