company

Data Engineer with PySpark Expertise

gsstech-groupDubai, Dubai, United Arab Emirates
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

Required Skills & Qualifications5+ years of experience in Data Engineering. Proven expertise in PySpark. Hands-on experience with the Cloudera Data Platform (CDP). In-depth knowledge of the Hadoop ecosystem (HDFS, Hive, Impala, YARN). Strong proficiency in SQL and data modeling principles. Familiarity with workflow orchestration tools (e.g., Apache Airflow, Oozie). Solid understanding of data warehousing concepts. Experience in performance tuning and optimization. Good to HaveExperience with cloud platforms such as AWS, Azure, or GCP. Familiarity with streaming technologies (e.g., Kafka, Spark Streaming).

About the job

Join gsstech-group as a talented Data Engineer specializing in PySpark and the Cloudera Data Platform (CDP). In this pivotal role, you will be responsible for designing, developing, and maintaining high-quality data pipelines that are both scalable and efficient. Your contributions will ensure optimal data performance and availability across our organization.

This position requires extensive hands-on experience in big data technologies, cloud-native environments, and advanced data processing frameworks. You will collaborate with diverse teams to create reliable data solutions that facilitate actionable business insights.

Key Responsibilities

1. Data Pipeline Development

  • Craft and sustain scalable ETL/ELT data pipelines utilizing PySpark on CDP.
  • Guarantee data integrity, reliability, and performance optimization.

2. Data Ingestion

  • Build ingestion frameworks to gather data from various sources including relational databases, APIs, streaming platforms, and file systems.
  • Load both structured and unstructured data into Data Lake and Data Warehouse environments.

3. Data Transformation & Processing

  • Process, cleanse, and transform extensive datasets using PySpark.
  • Create reusable data processing components.

4. Performance Optimization

  • Optimize Spark jobs and Cloudera components for peak performance.
  • Enhance memory, partitioning, and execution strategies.
  • Minimize ETL runtime and elevate cluster efficiency.

5. Data Quality & Validation

  • Establish data validation checks and monitoring systems.
  • Maintain comprehensive data quality and governance standards.

6. Automation & Orchestration

  • Automate workflows using Apache Oozie, Apache Airflow, or similar orchestration tools.
  • Support CI/CD integration for data pipelines.

7. Monitoring & Support

  • Oversee pipeline health and troubleshoot any issues that arise.
  • Provide ongoing production support and drive continuous improvement initiatives.

About gsstech-group

gsstech-group is a leading technology firm committed to innovation and excellence in data solutions. Our team is dedicated to harnessing the power of data to drive business transformation and growth.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.