About the job
Join Growe, a premier business advisory and services group specializing in iGaming and Entertainment. We craft effective strategies and scalable solutions that empower businesses to navigate the rapidly changing landscape, seize new opportunities, expand into new markets, and achieve sustainable growth.
Ideal candidates will:
Ensure the availability, performance, and scalability of infrastructure and services through vigilant monitoring, automation, and adherence to operational best practices.
Lead incident response efforts, conduct thorough root cause analyses, and implement both immediate recovery and long-term fixes.
Manage infrastructure through Terraform, Terragrunt, and other automation tools to ensure consistency and repeatability.
Implement and maintain robust metrics, logs, and tracing solutions using tools like Prometheus, Grafana, Loki, VictoriaMetrics, and CloudWatch to enhance system visibility.
Identify performance bottlenecks, fine-tune systems, and enhance infrastructure performance.
Monitor resource usage, forecast growth, and develop effective scaling strategies.
Integrate security best practices into Infrastructure as Code (IaC), Continuous Integration/Continuous Deployment (CI/CD) pipelines, and deployment processes.
Support vulnerability management initiatives.
Participate in a 24/7 on-call rotation (approximately once a week) to swiftly resolve critical incidents.
Collaborate with DevOps, PRE, development, and security teams to enhance reliability and design resilient systems.
Maintain detailed operational runbooks, incident reports, and system documentation.

