About the job
Air Apps builds technology to help people plan, work, and live better. Founded in Lisbon in 2018 and still family-led, the company has grown to San Francisco and now London, remaining self-funded and reaching over 100 million downloads.
Every day, teams at Air Apps challenge assumptions and develop AI-powered products that make a difference for users worldwide. The company values creativity and aims to improve how resources are managed and lives are impacted.
Role Overview
The Site Reliability Engineer (SRE) will focus on keeping Air Apps systems reliable, available, and scalable. This role connects software development and operations, using automation, monitoring, and performance tuning to reduce downtime and strengthen system resilience.
This is a fully onsite position based in the London Metropolitan Area. Air Apps will consider relocation support for the right candidate. The SRE will work closely with cross-functional teams in a busy office setting.
What You Will Do
- Design and implement systems that are scalable, reliable, and fault-tolerant across cloud platforms.
- Develop and maintain observability tools for monitoring, logging, and alerting (such as Prometheus, Grafana, Datadog, ELK).
- Automate infrastructure provisioning, deployment, and incident response using Infrastructure as Code tools like Terraform or CloudFormation.
- Improve system performance, scalability, and incident response processes to maximize uptime.
- Work with development and DevOps teams to strengthen system designs for reliability.
- Conduct root cause analysis and implement steps to prevent future failures.
- Design and maintain strategies for load balancing, failover, and disaster recovery to ensure high availability.

