About the job
Job Title: Data Engineer (PySpark)
________________________________________
About the Role
We are in search of a talented Data Engineer with significant expertise in PySpark and the Cloudera Data Platform (CDP) to bolster our data engineering team. In this role, you will be tasked with the design, development, and maintenance of scalable data pipelines that guarantee high data quality and availability across the organization. A robust background in big data ecosystems, cloud-native tools, and advanced data processing techniques is essential.
The right candidate will possess hands-on experience in data ingestion, transformation, and optimization within the Cloudera Data Platform, along with a solid history of implementing data engineering best practices. Collaboration with fellow data engineers will be key to creating solutions that yield impactful business insights.
Responsibilities
- Design, develop, and sustain highly scalable and optimized ETL pipelines utilizing PySpark on the Cloudera Data Platform, ensuring data integrity and precision.
- Oversee data ingestion processes from diverse sources (e.g., relational databases, APIs, file systems) to the data lake or data warehouse on CDP.
- Employ PySpark to process, cleanse, and transform extensive datasets into actionable formats that fulfill analytical needs and business objectives.
- Optimize performance by tuning PySpark code and Cloudera components to enhance resource utilization and minimize ETL runtimes.
- Establish data quality checks, monitoring, and validation protocols to uphold data accuracy and reliability throughout the pipeline.
- Automate data workflows utilizing tools such as Apache Oozie, Airflow, or comparable orchestration tools within the Cloudera ecosystem.
- Monitor pipeline performance, troubleshoot issues, and perform routine maintenance on the Cloudera Data Platform and related data processes.
- Collaborate closely with other data engineers, analysts, product managers, and other stakeholders to comprehend data requirements and support various data-driven initiatives.
- Document data engineering processes, code, and pipeline configurations thoroughly.

