Job Description
- In-depth understanding and knowledge of Hadoop and Spark architecture and RDD transformation
- Proven experience in developing solutions using Spark architecture and PySpark for data engineering pipelines, transformation, and aggregation of data from a variety of sources into the data lake.
- At least 3 or more years of relevant experience in developing PySpark programs using APIs. Expertise in different file formats like parquet, ORC.
- Experience with troubleshooting, fine-tuning Spark and python based applications for scalability and performance.
- Experience in designing hive tables to handle velocity, variety and to handle huge volumes.
- Experience in data ingestion, processing and analyzing data using Spark/SQL from disparate sources.
- Knowledge in using Spark-Submit and Spark UI. Experience in creating and then performing operations on Spark RDD.
- Experience in creating Spark Data Frames from RDD, HIVE and Parquet files and then performing Joins and Aggregations on Dataframes.
- Experience in processing data from Python and other API modules.