About the job
Join the innovative team at Novibet as an Site Reliability Engineer!
Are you prepared to play a pivotal role in a vibrant, rapidly expanding company? If you possess a strong commitment to reliability and scalability and thrive in a high-energy environment, this is the perfect opportunity for you.
About Us
Established in 2010, Novibet is a leading GameTech company with a presence across Europe, the Americas, and other regions, including Greece, Brazil, Ireland, Finland, Mexico, Chile, Ecuador, Cyprus, and New Zealand. With operational hubs in Greece, Malta, Brazil, and Mexico, our workforce exceeds 1,200 employees globally. We are dedicated to leveraging cutting-edge technology to provide seamless entertainment and gaming experiences to our ever-growing customer base.
Why Work with Novibet?
At Novibet, we empower our employees to excel by fostering a culture of growth through continuous learning and adaptation. Our team of forward-thinkers is dedicated to tackling new challenges while maintaining a positive, inclusive, and supportive workplace where everyone can flourish.
Join our global team of over 1,200 dedicated professionals who value collaboration, innovation, and personal development.
Your Responsibilities
- Act as the primary on-call responder for platform incidents.
- Lead triage efforts, collaborate with DevOps and Engineering teams, and manage resolution processes from start to finish.
- Conduct comprehensive post-incident reviews to address root causes, not just close tickets.
- Manage the complete observability stack, including alerting, dashboards, log aggregation, and distributed tracing.
- Establish and uphold standards for defining service health across critical services.
- Refine alert thresholds to minimize noise and enhance signal quality.
- Systematically work to reduce the frequency and impact of future incidents.
- Analyze trends, monitor SLI/SLO performance, and pinpoint chronic issues.
- Drive enhancements in platform configuration, architecture, and operational practices.
- Identify and eliminate manual and repetitive operational tasks.
- Create runbooks, self-healing scripts, and tools for automated remediation.
- Enhance organizational resilience beyond just improving team efficiency.
- Establish and monitor Service Level Indicators and Objectives in collaboration with Service Managers.
- Serve as the technical advocate during reliability discussions.
- Collaborate with Engineers to embed reliability into releases, capacity planning, and load testing.
- Work with DevOps on strengthening infrastructure.
- Participate in release planning as a reliability checkpoint before critical deployments.

