50+ days of Machine Learning

Posts

Showing posts from June, 2024

[Day 181] Lending club data engineering project - Done

6/30/2024 10:37:00 pm

Hello :) Today is Day 181! A quick summary of today: completed and documented my data engineering project Everything is on my github repo , but below I will provide an overview. A diagram overview of the tech used: Raw Lending Club data from Kaggle Mage is used to orchestrate an end to end process including: extract data using kaggle's API and load it to the Google Cloud Storage (used as a data lake) create tables in BigQuery (used as a data warehouse) run dbt transformation jobs Terraform is used to manage and provision the infrastructure needed for the data pipeline on Google Cloud Platform dbt is used to transform the data into dimension tables, add data tests, and create data documentation Looker is used to create a visualisation dashboard For the dbt documentation, I was using dbt cloud IDE for the development, but to deploy a docs I needed to get its files, so the easiest way was to sync and run dbt locally. Setting up dbt to sync with local files was not hard, and this gave...

[Day 180] From Kaggle to BigQuery dimension tables - an end2end pipeline

6/29/2024 11:29:00 pm

Hello :) Today is Day 180! A quick summary of today: finished data modelling in dbt set up PROD in dbt set up automatic dbt job runs in mage created an end to end pipeline All code from today is on my github repo . 1. Settling down on a data model in dbt I went over a few different today, but I ended up with the above one. Because all my data is coming from 1 source I felt like, in order to avoid redundancy - I just decided to have dim_loans as the main table, and then have dim_borrower which includes just info about the borrower and dim_date just about the loan issue date. I also added data description fields, and some tests. The below pics are taken from the dbt generated documentation: dim_borrower dim_date dim_loans (image is truncated as there are many fields) 2. Setting up a PROD environment for dbt After I finally settled on a data modelling architecture, I created a PROD env to run all the models in a job. Not seen in the pic, but there is an 'API trigger' button which...

[Day 179] Using Docker, Makefile, and starting Data modelling for my Lending club project

6/29/2024 12:21:00 am

Hello :) Today is Day 179! A quick summary of today: continued working on my Lending club data engineering project Today I added a few cool features, and I learned plenty 1. Introduced docker to the project Here is the Dockerfile I created Today I learned more about the docker folder structure and where to copy what, and where things live. At first I was not sure where things go, in which directory should I point my env vars, and where should I copy files. But then I also found that in Docker desktop I can view the files in a running image, so that is how I figured out what and where. The bash script referenced is here: And my docker-compose.yml (before I added the volumes, the code I was writing in mage was not persisting, so now I know what happens without volumes) 2. Added a Makefile I also made a Makefile (using this for the first time). I saw that adding a Makefile is good in the data eng zoomcamp project advices. And is good for reproducability. These are the options that ca...

[Day 178] Starting 'Lending club data engineering project'

6/27/2024 11:34:00 pm

Hello :) Today is Day 178! A quick summary of today: started my own data engineering project GCP as cloud storage solution terraform for infrastructure management mage for orchestration dbt for modelling maybe more to come The last bit of Module 4 was about creating a spark cluster in GCP - using Dataproc We create a cluster, which creates a VM for it. On that cluster we can add PySpark (or other types like Spark) jobs where we can submit python scripts (that use PySpark) that we wrote. And we can run them on the cloud And because this cluster is created on GCP, it can directly read data from GCS and also write data to it. Now ... onto the Lending club data pipeline project It is still early stages but, I had an idea in mind how to combine the different tools I learned from the DataTalksClub data engineering camp. I am taking inspiration from the data engineering zoomcamp where full documentation is encouraged so I will try to create nice graphs, visualisations and exp...

[Day 177] Spark for batch processing

6/26/2024 11:00:00 pm

Hello :) Today is Day 177! A quick summary of today: started Module 5: batch processing from the data eng zoomcamp Spark Spark operations Connecting PySpark to GCS What is batch processing of data? method of executing data processing tasks on a large volume of collected data all at once, rather than in real-time. It is often done at scheduled times (i.e. hourly, daily, weekly, x times per hour or minutes) or when sufficient data is accumulated Technologies used for batch processing: python scripts SQL Spark flink Advantages of batch jobs: easy to manage retry scale Disadvantages: delay in getting fresh data Spark is the most popular batch processing tool, and its variation in PySpark is popular. I have used PySpark during my placement at Lloyds Banking Group back in 2019-2020 so there was not much new info~ The most important bit is that spark works with clusters and in order to utilise spark's power in handling large datasets, we need to partition our data. Then, I got a r...

[Day 176] Testing, Documentation, Deployment with dbt and visualisations with Looker

6/25/2024 10:38:00 pm

Hello :) Today is Day 176! A quick summary of today: finished Module 4 : analysis engineering and using dbt from the data engineering zoomcamp A preview of what I created in the end Continuing from yesterday with dbt ~ First I learned about testing and documenting dbt models We need to make sure the data we deliver to the end user is correct, and how do we make sure that we are not building our dbt models on top of incorrect data? dbt tests assumptions that we make about our data tests in dbt are essentially a select sql query these assumptions get compliled to sql that returns the amount of failing records tests are defined on a column in the .yml file dbt provides basic tests to check if the column values are: unique, not null, accepted values, a foreign key to another table we can create custom tests as queries Before writing tests, to ensure our data's schema is correct we can autogenerate it using a package First, include the package in packages.yml (and run dbt deps to insta...