About the job
Join TWG Group Holdings, LLC ("TWG Global"), where we are at the forefront of innovation and business transformation across multiple industries, including financial services, insurance, technology, media, and sports. We harness the power of data and AI as pivotal assets in our operations. Our AI-driven, cloud-native approach allows us to deliver real-time intelligence and interactive business applications, facilitating informed decision-making for our clients and employees alike.
We are committed to ethical data and AI practices, ensuring compliance with regulatory standards. Our decentralized structure empowers each business unit to operate independently while being bolstered by a centralized AI Solutions Group. Strategic collaborations with leading data and AI vendors drive significant advancements in marketing, operations, and product development.
In this role, you will work closely with management to propel our data and analytics transformation, enhance productivity, and enable agile, data-driven decisions. By leveraging our partnerships with top tech startups and academic institutions, you will help foster competitive advantages and stimulate enterprise innovation.
At TWG Global, your efforts will directly contribute to our ambitions for sustained growth and remarkable returns, as we aim to deliver unparalleled value and impact across our various business sectors. Our rapidly expanding AI/ML team is dedicated to providing high-impact solutions to financial institutions, insurers, and other regulated enterprises. Supported by seasoned leaders in finance and national security, our team is scaling quickly to meet client demands across North America with robust, secure, and production-ready AI solutions.
Role Overview
We are currently on the lookout for a Site Reliability Engineer (SRE) to ensure the scalability, stability, and performance of our data platforms and ML infrastructure. You will closely collaborate with data scientists, ML engineers, and platform vendors to deploy and monitor production systems, automate workflows, and minimize operational overhead.
Key Responsibilities:
- Develop and maintain infrastructure for real-time and batch ML workloads.
- Implement observability tools for model performance, logging, monitoring, and alerting.
- Design and manage CI/CD pipelines for ML and data applications.
- Ensure high availability, disaster recovery, and rollback capabilities for production environments.
- Collaborate with compliance and IT to manage access controls, secrets, and security policies.
- Troubleshoot incidents, lead postmortems, and drive root-cause resolution.
- Coordinate with U.S. and international teams to provide 24/7 coverage across time zones.

