[Day 170] Uber data engineering project using GCP and Mage

6/20/2024 01:45:00 am

Hello :)
Today is Day 170!

A quick summary of today:
decided to give mage.ai another go and now use it alongside GCP

Well... I obviously did not suffer enough with the countless problems I experienced when I was first learning about mage from DataTalksClub's MLOps zoomcamp, so I decided to do a cool-looking data engineering project [youtube]. The caveat is that I will use GCP. And I just hope I do not incur any major costs. I checked, and I have 21 days left on plenty of free credits.

Getting to the project

It uses the infamous NYC taxi dataset.

Using lucid I learned a bit amout data deminsion modelling

Then using python, some basic preprocessing was done on the raw data, to convert it into the top 8 tables. I actually put everything so far on my github, and plan on doing a nice readme documentation once everything is finished. Even though I started it after work today, I did not finish it because of *again* mage problems.

Before I get to the mage problems, a bit about GCP.

I set up a cloud storage (similar to S3 in AWS) and uploaded the raw data.

Then I set up a VM with the right access permissions. And this gave ma nice SSH-in browser to install python and run mage. There is a lot of IP addreses, so I am afraid to share pictures today.

Using that VM, I installed python, pip and mage, and started a project.

Added a block to import data, and a transformer block that basically follows the jupyter notebook in my repo. However whenever I run the transformer block that ends up with the above 8 tables, the kernel restarts and seems like it never starts again. I restart mage through the terminal, same thing happens.

Turns out mage bugs out when I was converting 8 datasets to dicts, and when I just retirn 8 dataframes as a tuple - it is all fine.

Then I created a user and credentials so I can connect to big query and push data to it,

The final mage pipeline looks like this:

And ~

yay~~~

This took a bit of debugging as well due to uninstalled packages and putting access keys and secrets i n the right place and format.

FYI, this is the project stack:

So for the final step ~ using looker for some kind of a visualisation dashboard.

This is the final dashboard (unsurprisingly, looker is very similar to powerBI). Looks awesome. Can't wait to apply what I learned to data of my choosing.

Lastly, I deleted everything from GCP, but I know what I need next time. ^^

Just a quick mention - today I kept reviewing my intro that I wrote yesterday, and added a table which I can share.

All of these papers' summaries/notes I have shared throughout the last few weeks, and I was taught that tables like that (and even more detailed) are a good idea in a paper.

That is all for today!

See you tomorrow :)

Search This Blog

50+ days of Machine Learning