About the job
DataHub is an AI & Data Context Platform utilized by over 3,000 enterprises, including industry leaders like Apple, CVS Health, Netflix, and Visa. Developed in collaboration with a vibrant open-source community of more than 13,000 members, DataHub's metadata graph offers profound insights into AI and data assets, ensuring unparalleled scalability and extensibility.
Our flagship enterprise SaaS offering, DataHub Cloud, provides a fully managed solution featuring AI-driven discovery, observability, and governance tools. Organizations leverage DataHub's innovative solutions to enhance the value of their data investments, guarantee the reliability of AI systems, and establish unified governance, bringing order to the complexities of data management.
About the Role
We are in search of a seasoned Site Reliability Engineering (SRE) Tech Lead to join our team at DataHub. This pivotal role will involve spearheading initiatives that enhance the reliability, scalability, and operational excellence of our platform offerings. You will oversee technical projects across DataHub Cloud and our evolving enterprise deployment solutions, empowering customers with greater control and flexibility in managing DataHub within their preferred environments.
Key Responsibilities
Technical Leadership & Architecture
- Design and develop robust, scalable infrastructure solutions for DataHub Cloud and enterprise deployments.
- Lead the technical vision for multi-cloud deployment strategies and distributed system integrations.
- Architect monitoring, observability, and alerting systems across various environments.
- Promote best practices for infrastructure as code, configuration management, and deployment automation.
Enterprise Platform Development
- Collaborate with product and engineering teams to shape the development of advanced deployment capabilities.
- Work alongside cross-functional teams to create systems for seamless installation, upgrades, and rollback processes across diverse environments.
- Contribute to the design and implementation of comprehensive monitoring and health check systems for distributed deployments.
- Collaborate with engineering teams to develop self-healing and automated remediation capabilities.
Platform Reliability & Operations
- Establish and uphold SLAs/SLOs for both cloud and enterprise offerings.
- Lead incident response and conduct post-mortem analyses to drive continuous improvement.
- Implement chaos engineering practices to enhance system resilience and reliability.

