[Day 180] From Kaggle to BigQuery dimension tables

[Day 180] From Kaggle to BigQuery dimension tables - an end2end pipeline

6/29/2024 11:29:00 pm

Hello :)
Today is Day 180!

A quick summary of today:
finished data modelling in dbt
set up PROD in dbt
set up automatic dbt job runs in mage
created an end to end pipeline

All code from today is on my github repo.

1. Settling down on a data model in dbt

I went over a few different today, but I ended up with the above one. Because all my data is coming from 1 source I felt like, in order to avoid redundancy - I just decided to have dim_loans as the main table, and then have dim_borrower which includes just info about the borrower and dim_date just about the loan issue date.

I also added data description fields, and some tests.

The below pics are taken from the dbt generated documentation:

dim_borrower

dim_date

dim_loans (image is truncated as there are many fields)

2. Setting up a PROD environment for dbt

After I finally settled on a data modelling architecture, I created a PROD env to run all the models in a job.

Not seen in the pic, but there is an 'API trigger' button which shows how we can trigger job runs from some other pipeline orchestrator. This gave me the idea to use this in mage

3. Automatic dbt build trigger in mage

It is just a one block pipeline

But it took me some time to figure out an issue. The issue was that I kept getting error 401 Unauthorized when I used mage's dbt trigger integration. Turns out, the functionality only works when mage is used from their docker image. However, I am running mage, after installing it with pip. I checked whether my API link actually works using postman and then I just switched to using python's request library to send a trigger request to dbt.

4. An end to end pipeline in mage

At this point, I had 3 pipelines in mage:

data_raw_to_gcs - which takes data from the kaggle api, does small checks and loads it to GCS
gcs_to_bq - SQL queries to load data from GCS to BigQuery
run_dbt_job - the pipeline described above

So now, I wanted to make an end to end one that executes all 3 of the above pipelines. And mage has a nice way of doing it too.

I created this end_to_end 1 block pipeline

And it executes all three above - from downloading Kaggle data to uploading the final dim_loans, dim_borrower, dim_date to BigQuery.

5. A small thing

I updated the Makefile help options and descriptions again. To make them a bit more user-friendly

Tomorrow, I will start working on a dashboard in Looker. ^^

That is all for today!

See you tomorrow :)

Search This Blog

50+ days of Machine Learning