About the job
Join gsstech-group as a talented Data Engineer specializing in PySpark and the Cloudera Data Platform (CDP). In this pivotal role, you will be responsible for designing, developing, and maintaining high-quality data pipelines that are both scalable and efficient. Your contributions will ensure optimal data performance and availability across our organization.
This position requires extensive hands-on experience in big data technologies, cloud-native environments, and advanced data processing frameworks. You will collaborate with diverse teams to create reliable data solutions that facilitate actionable business insights.
Key Responsibilities
1. Data Pipeline Development
- Craft and sustain scalable ETL/ELT data pipelines utilizing PySpark on CDP.
- Guarantee data integrity, reliability, and performance optimization.
2. Data Ingestion
- Build ingestion frameworks to gather data from various sources including relational databases, APIs, streaming platforms, and file systems.
- Load both structured and unstructured data into Data Lake and Data Warehouse environments.
3. Data Transformation & Processing
- Process, cleanse, and transform extensive datasets using PySpark.
- Create reusable data processing components.
4. Performance Optimization
- Optimize Spark jobs and Cloudera components for peak performance.
- Enhance memory, partitioning, and execution strategies.
- Minimize ETL runtime and elevate cluster efficiency.
5. Data Quality & Validation
- Establish data validation checks and monitoring systems.
- Maintain comprehensive data quality and governance standards.
6. Automation & Orchestration
- Automate workflows using Apache Oozie, Apache Airflow, or similar orchestration tools.
- Support CI/CD integration for data pipelines.
7. Monitoring & Support
- Oversee pipeline health and troubleshoot any issues that arise.
- Provide ongoing production support and drive continuous improvement initiatives.

