[Day 175] Learning about and using dbt cloud

6/24/2024 11:56:00 pm

Hello :)
Today is Day 175!

A quick summary of today:
basic SQL in BigQuery
continued Module 4: analytics engineering and dbt from the data eng camp and learned about dbt

There were some bits from Module 3 that I did not finish. About using BigQuery to create a partitioned and clustered dataset and see the impact it has on processed data.

Using the infamous nyc taxi dataset, I executed some simple queries on creating external datasets in BigQuery and could see the effect of partitioning and clustering

First create a non-partitioned and partitioned table in BigQuery

The dataset is by dates so we are using one of the datetime columns to partition by it. Below we can see a significant drop in processed data when doing a simple WHERE between dates on the non-partitioned vs partitioned tables.

I faced an issue here because when creating the partitioned data, in BigQuery there is a table details section where it says 0 partitions. And I was really confused because the data I was partitioning is 10GB and I saw that for other people it says there are partitions there. But for me it says 0. I kept rerunning and changing/uploading more data.

But it turns out that even though it says 0 partitions in the Number of partitions part, my dataset was indeed partitioned and the results were clear (as in the pic)

And finally a table that is both partitoned and clustered, and there is a decreased in processed MB of data in the two same queries as well.

Now onto Module 4

First I learned about ETL vs ELT

Extract-Transform-Load:

slightly more stable and compliant data analysis
higher storage and compute costs

Extract-Load-Transform

faster and more flexible data analysis
lower cost and lower maintenance

Then - Kimball's dimensional modelling

Objective

deliver data understandable to the business users
deliver fast query performance

Approach

prioritise user understandability and query performance over non redundant data (3NF)

Other approaches:

Bill Inmon
Data vault

Elements of Dimensional Modelling

Facts tables

measurements, metrics or facts
corresponds to a business process
'verbs'

Dimensions tables

corresponds to a business entity
provides context to a business process
'nouns'

Architecture of Dimensional Modelling

Stage area

contains raw data
not meant to be exposed to everyone

Processing area

from raw data to data models
focuses on efficiency
ensuring standards

Presentation area

final presentation of the data
exposure to business stakeholders

What is dbt?

It is a transformation workflow that allows anyone that knows SQL to deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation.

Workflow we will use:

Creating a new dbt project

This took some time to figure out creating right folders and access to my repo, but in the end:

When creating a new project there are sample dbt data models (with its lineage)

Anatomy of a dbt model

Materialisations in dbt cloud

Ephermal - temporary and exist only for a duration of a single dbt run
View - virtual tables created by dbt that can be queried like regular tables
Table - physical representations of data that are created and stored in the database
Incremental - powerful feature of dbt that allow for efficient updates to existing tables, reducing the need for full data refreshes

In dbt we will select FROM two types

Sources

the data loaded to our datawarehouse that we use as sources for our models
configuration defined in the yml files in the models folder
used with the source macro that will resolve the name to the right schema, plus build the dependencies automatically
source freshness can be defined and tested

Seeds

CSV files stored in our repo under the seed folder
benefits of version controlling
equivalent to a copy command
recommended for data that does not change frequently (and is not huge)
runs with `dbt seed -s file_name`

Ref

macro to reference the underlying tables and views that were building the data warehouse
run the same code in any environment, it will resolve the correct schema for us
dependencies are built automatically

Now onto creating my 1st model

In the models folder, create a new folder and in it a .yml file

The names in this yml file come from my BigQuery dataset

By clicking on 'Generate model' above the -name: green_tripdata, we get

To run this, we use `dbt build` and dbt executes the models in order of the folders. Then, the below is seen: (ignoring the failed test)

Next, I learned about Macros

use control structures (i.e. if statements and for loops) in SQL
use env vars in our dbt project for production deployments
operate on the results of one query to generate another query
abstract snippets of SQL into reusable macros - these are analogous to functions in most programming languages

Creating a macro (reminds me of excel VBA in high school with the macro syntax)

Then we can use it as a normal function in our models

Then, I learned about packages

like libraries in other programming languages
standalone dbt projects, with models and macros that tackle a specific problem area
by adding a package to a project, the package's models and macros will become part of that project
imported in the packages.yml file and imported by running `dbt deps`

We can add packages to packages.yml file which lives on the same level as our dbt_project.yml

Then run with `dbt deps`

Then we can use functions from these packages in our data models.

Once we run models, we can also see their results in BigQuery as well:

Then, I learned about variables

useful for defining values that should be used across the project
with a macro, dbt allows us to provide data to models for compilation
to use a variable, we use the {{ var('...') }} function
variables can be defined in two ways:

in the dbt_project.yml, or
in the command line

Example:

When we do dbt build this_model and we provide the var is_test_run, there will be a limit 100 added.

Next, after adding a model for the yellow_tripdata, the project's data lineage looks like:

After adding a seed data, and creating a model for it as well, we can combine all and get something that looks a bit more modular

Here is an example error that shows some details

After fixing ~ Nice

I am finishing up now ~ and this is how the folder system looks

and I can push directly to my github, which is awesome.
Everything is on my repo.

This is not all for Module 4, tomorrow hopefully I will cover the rest ^^

That is all for today!

See you tomorrow :)

Search This Blog

50+ days of Machine Learning