[Day 180] From Kaggle to BigQuery dimension tables - an end2end pipeline

 Hello :)
Today is Day 180!


A quick summary of today:
  • finished data modelling in dbt
  • set up PROD in dbt
  • set up automatic dbt job runs in mage
  • created an end to end pipeline


All code from today is on my github repo.


1. Settling down on a data model in dbt

I went over a few different today, but I ended up with the above one. Because all my data is coming from 1 source I felt like, in order to avoid redundancy - I just decided to have dim_loans as the main table, and then have dim_borrower which includes just info about the borrower and dim_date just about the loan issue date.
I also added data description fields, and some tests.

The below pics are taken from the dbt generated documentation:
  • dim_borrower
  • dim_date
  • dim_loans (image is truncated as there are many fields)

2. Setting up a PROD environment for dbt

After I finally settled on a data modelling architecture, I created a PROD env to run all the models in a job.
Not seen in the pic, but there is an 'API trigger' button which shows how we can trigger job runs from some other pipeline orchestrator. This gave me the idea to use this in mage

3. Automatic dbt build trigger in mage

It is just a one block pipeline

But it took me some time to figure out an issue. The issue was that I kept getting error 401 Unauthorized when I used mage's dbt trigger integration. Turns out, the functionality only works when mage is used from their docker image. However, I am running mage, after installing it with pip. I checked whether my API link actually works using postman and then I just switched to using python's request library to send a trigger request to dbt. 

4. An end to end pipeline in mage

At this point, I had 3 pipelines in mage: 

  • data_raw_to_gcs - which takes data from the kaggle api, does small checks and loads it to GCS
  • gcs_to_bq - SQL queries to load data from GCS to BigQuery
  • run_dbt_job - the pipeline described above
So now, I wanted to make an end to end one that executes all 3 of the above pipelines. And mage has a nice way of doing it too. 
I created this end_to_end 1 block pipeline
And it executes all three above - from downloading Kaggle data to uploading the final dim_loans, dim_borrower, dim_date to BigQuery. 


5. A small thing 

I updated the Makefile help options and descriptions again. To make them a bit more user-friendly


Tomorrow, I will start working on a dashboard in Looker. ^^

That is all for today!

See you tomorrow :)

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[Day 107] Transforming natural language to charts

[Day 54] I became a backprop ninja! (woohoo)