Posts

Showing posts from June 26, 2024

[Day 177] Spark for batch processing

Image
 Hello :) Today is Day 177! A quick summary of today: started Module 5: batch processing from the data eng zoomcamp Spark Spark operations Connecting PySpark to GCS What is batch processing of data? method of executing data processing tasks on a large volume of collected data all at once, rather than in real-time. It is often done at scheduled times (i.e. hourly, daily, weekly, x times per hour or minutes) or when sufficient data is accumulated Technologies used for batch processing: python scripts SQL Spark flink Advantages of batch jobs: easy to manage retry scale Disadvantages: delay in getting fresh data Spark is the most popular batch processing tool, and its variation in PySpark is popular. I have used PySpark during my placement at Lloyds Banking Group back in 2019-2020 so there was not much new info~ The most important bit is that spark works with clusters and in order to utilise spark's power in handling large datasets, we need to partition our data.  Then, I got a refres