Posts

Showing posts from June 27, 2024

[Day 178] Starting 'Lending club data engineering project'

Image
 Hello :) Today is Day 178! A quick summary of today: started my own data engineering project GCP as cloud storage solution terraform for infrastructure management mage for orchestration dbt for modelling maybe more to come The last bit of Module 4 was about creating a spark cluster in GCP - using Dataproc We create a cluster, which creates a VM for it.  On that cluster we can add PySpark (or other types like Spark) jobs where we can submit python scripts (that use PySpark) that we wrote. And we can run them on the cloud And because this cluster is created on GCP, it can directly read data from GCS and also write data to it.  Now ... onto the Lending club data pipeline project It is still early stages but, I had an idea in mind how to combine the different tools I learned from the DataTalksClub data engineering camp. I am taking inspiration from the data engineering zoomcamp where full documentation is encouraged so I will try to create nice graphs, visualisations and explanations when