About the job
Key Responsibilities:
- Architect, develop, and maintain efficient and scalable batch and stream data processing infrastructures to facilitate day-to-day machine learning operations, including training, serving, evaluation, and experimental systems.
- Create and implement foundational data models, data warehouses, and processing pipelines (both real-time and offline) using technologies such as AWS EMR Spark, Apache Kafka, AWS Athena, Snowflake, Airflow, and Apache HUDI.
- Collaborate closely with machine learning and data science teams to assess their data requirements, influence the data team’s strategic roadmap, and lead the execution of various initiatives.
- Establish a data governance platform to ensure secure and compliant data management, encompassing services for data cataloging, lineage tracking, auditing, data deletion, and masking.
- Develop and manage orchestration platforms utilizing Temporal and Airflow, empowering other teams to create features and workflows.
- Design and enhance platform and data services/APIs to provide data access for diverse stakeholders and customer-facing data products.

