About the job
About Us
At Perlego, we are dedicated to revolutionizing education by making it accessible to everyone. In this digital era, we believe that learning should be available to all, anytime, anywhere. Knowledge should not be confined by exorbitant costs.
Over the last seven years, our mission has focused on empowering students across the UK and Europe to access high-quality books. We aim to expand our reach globally, particularly into the US market, and create a platform that transcends traditional books by aiding students in studying more efficiently and effectively.
Your Role
We are in search of a seasoned Cloud Infrastructure Engineer with extensive experience in AWS services and monitoring tools. In this critical role, you will be responsible for ensuring the availability and reliability of our services. You will play a key role in promptly addressing issues, resolving incidents autonomously, and thriving in a dynamic environment.
Key Responsibilities:
Cloud Infrastructure Management:
- Administer and support our AWS infrastructure, emphasizing scalability, security, and reliability.
- Manage deployments and oversee CI/CD pipelines for both containerized (Docker/ECS) and serverless (AWS Lambda) applications.
- Implement effective backup and recovery strategies to minimize downtime.
- Oversee operational and analytical data stores, including Aurora MySQL, DynamoDB, and Databricks.
Monitoring & Incident Management:
- Utilize tools such as Prometheus, Grafana, and AWS CloudWatch to monitor platform activity.
- Swiftly respond to alerts and incidents, independently resolving issues to maintain service uptime.
- Conduct post-incident reviews and enhance system resiliency through automation and monitoring improvements.
- Analyze network activity using AWS Security Hub and Cloudflare.
Collaboration & Communication:
- Work collaboratively with cross-functional teams to implement platform enhancements.
- Demonstrate the ability to make swift decisions independently, especially when managing service incidents outside regular business hours.
- Support platform security and adhere to best practices for cloud security and compliance.
Continuous Improvement:
- Automate repetitive processes to minimize human error and enhance efficiency.
- Continuously refine monitoring systems to ensure robust early detection and resolution capabilities.
- Identify and address potential performance bottlenecks.

