About the job
About Us
Articul8 AI is a leader in Generative AI innovation, offering state-of-the-art SaaS products that redefine business operations. Our platform enables organizations to harness the power of artificial intelligence in a dependable, scalable, and secure environment.
Position Overview
We are on the lookout for a skilled Senior Site Reliability Engineer (SRE) to join our dynamic team and ensure the reliability, performance, and scalability of our GenAI SaaS platform. In this pivotal role, you will serve as a vital link between development and operations, employing automation and best practices to uphold our service reliability goals while facilitating swift innovation.
Key Responsibilities
- Architect and maintain scalable, highly available infrastructure for our GenAI platform.
- Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance.
- Automate deployment, scaling, and management of our cloud-native infrastructure, minimizing toil and enhancing efficiency.
- Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver exceptional service quality.
- Participate in on-call rotations and deliver rapid response to production incidents, reducing downtime and user impact.
- Collaborate closely with development teams to create reliable, scalable, and efficient systems for complex AI workloads.
- Lead incident response efforts, conduct detailed post-mortems, and promote continuous improvement initiatives.
- Optimize infrastructure for performance, scalability, and cost-effectiveness, particularly for high-demand AI workloads.
- Implement and enforce security best practices across all systems and environments.
- Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.

