[Day 181] Lending club data engineering project - Done
Hello :)
Today is Day 181!
A quick summary of today:- completed and documented my data engineering project
Everything is on my github repo, but below I will provide an overview.
A diagram overview of the tech used:
- Raw Lending Club data from Kaggle
- Mage is used to orchestrate an end to end process including:
- extract data using kaggle's API and load it to the Google Cloud Storage (used as a data lake)
- create tables in BigQuery (used as a data warehouse)
- run dbt transformation jobs
- Terraform is used to manage and provision the infrastructure needed for the data pipeline on Google Cloud Platform
- dbt is used to transform the data into dimension tables, add data tests, and create data documentation
- Looker is used to create a visualisation dashboard
For the dbt documentation, I was using dbt cloud IDE for the development, but to deploy a docs I needed to get its files, so the easiest way was to sync and run dbt locally. Setting up dbt to sync with local files was not hard, and this gave me easy access to a 'target' folder which contains files for the auto-generated documentation. Once I got this folder, I 'dropped' it into Netlify and boom - the dbt docs is now available for all.
Today I completed the visualisation dashboard in Looker, which is the last official bit from the project.
Here is a link to Looker. It was a learning curve... one thing I struggled with was histograms. They do not seem to be a native chart in looker so I had some bad looking bar charts for a bit. Then I saw that I can create bins of numerical variables and that is how the middle and bottom right charts became a reality. Overall, it was pretty smooth but Looker can be kind of slow sometimes if I make too many changes quickly.
Overall thoughts ~
All of these technologies I learned about from DataTalksClub's Data Engineering zoomcamp. The value I got from this *free* course is astonishing. Thank you so much to the origanisers, I learned about data modelling, the cloud, infrastructure as code, data pipelines, orchestration, automation, batch, streaming, and of course the exact tools from the first picture.
I am currently doing their LLM zoomcamp and MLOps zoomcamp so I hope I have a project from there too in the near future.
That is all for today!
See you tomorrow :)