Posts

Showing posts from June 29, 2024

[Day 180] From Kaggle to BigQuery dimension tables - an end2end pipeline

Image
 Hello :) Today is Day 180! A quick summary of today: finished data modelling in dbt set up PROD in dbt set up automatic dbt job runs in mage created an end to end pipeline All code from today is on my github repo . 1. Settling down on a data model in dbt I went over a few different today, but I ended up with the above one. Because all my data is coming from 1 source I felt like, in order to avoid redundancy - I just decided to have dim_loans as the main table, and then have dim_borrower which includes just info about the borrower and dim_date just about the loan issue date. I also added data description fields, and some tests. The below pics are taken from the dbt generated documentation: dim_borrower dim_date dim_loans (image is truncated as there are many fields) 2. Setting up a PROD environment for dbt After I finally settled on a data modelling architecture, I created a PROD env to run all the models in a job. Not seen in the pic, but there is an 'API trigger' button which

[Day 179] Using Docker, Makefile, and starting Data modelling for my Lending club project

Image
 Hello :) Today is Day 179! A quick summary of today: continued working on my Lending club data engineering project Today I added a few cool features, and I learned plenty 1. Introduced docker to the project Here is the Dockerfile I created Today I learned more about the docker folder structure and where to copy what, and where things live. At first I was not sure where things go, in which directory should I point my env vars, and where should I copy files. But then I also found that in Docker desktop I can view the files in a running image, so that is how I figured out what and where. The bash script referenced is here: And my docker-compose.yml (before I added the volumes, the code I was writing in mage was not persisting, so now I know what happens without volumes) 2. Added a Makefile I also made a Makefile (using this for the first time). I saw that adding a Makefile is good in the data eng zoomcamp project advices. And is good for reproducability. These are the options that can be