About the job
About the Role
We are seeking a highly skilled Senior Automation and Tools Engineer to enhance our platform's scalability, reliability, observability, and operational excellence. In this pivotal role, you will collaborate with engineers and data scientists to design, automate, and sustain the infrastructure that drives our core platform, which includes data pipelines, machine learning workloads, and real-time analytics systems.
This is a hands-on, impactful position with visibility across the tech stack, providing a unique opportunity to influence the future of our infrastructure and operations.
Key Responsibilities
- Collaborate closely with Site Reliability Engineers (SREs) and developers to automate and manage datacenter and cloud-based infrastructure, focusing on build and deployment systems, as well as configuration management.
- Enhance system reliability and performance through automation, observability, and proactive capacity planning.
- Develop and configure tools for datacenter provisioning, configuration management, and observability using both standard and custom tools.
- Elevate developer experience by providing self-service tools.
- Implement and uphold monitoring, alerting, and incident response processes including Service Level Objectives (SLOs), runbooks, and on-call rotations.
- Foster collaboration across engineering and data science teams to cultivate a culture of performance and reliability.
- Maintain security, compliance, and operational readiness across our physical and cloud infrastructures.
- Lead post-incident reviews and continuous improvement initiatives.

