[Day 170] Uber data engineering project using GCP and Mage

 Hello :)
Today is Day 170!


A quick summary of today:
  • decided to give mage.ai another go and now use it alongside GCP


Well... I obviously did not suffer enough with the countless problems I experienced when I was first learning about mage from DataTalksClub's MLOps zoomcamp, so I decided to do a cool-looking data engineering project [youtube]. The caveat is that I will use GCP. And I just hope I do not incur any major costs. I checked, and I have 21 days left on plenty of free credits. 

Getting to the project

It uses the infamous NYC taxi dataset. 

Using lucid I learned a bit amout data deminsion modelling

Then using python, some basic preprocessing was done on the raw data, to convert it into the top 8 tables. I actually put everything so far on my github, and plan on doing a nice readme documentation once everything is finished. Even though I started it after work today, I did not finish it because of *again* mage problems. 

Before I get to the mage problems, a bit about GCP.

I set up a cloud storage (similar to S3 in AWS) and uploaded the raw data.

Then I set up a VM with the right access permissions. And this gave ma nice SSH-in browser to install python and run mage. There is a lot of IP addreses, so I am afraid to share pictures today. 

Using that VM, I installed python, pip and mage, and started a project.

Added a block to import data, and a transformer block that basically follows the jupyter notebook in my repo. However whenever I run the transformer block that ends up with the above 8 tables, the kernel restarts and seems like it never starts again. I restart mage through the terminal, same thing happens.

Turns out mage bugs out when I was converting 8 datasets to dicts, and when I just retirn 8 dataframes as a tuple - it is all fine.

Then I created a user and credentials so I can connect to big query and push data to it, 

The final mage pipeline looks like this:

And ~

yay~~~

This took a bit of debugging as well due to uninstalled packages and putting access keys and secrets i n the right place and format. 

FYI, this is the project stack:

So for the final step ~ using looker for some kind of a visualisation dashboard.
This is the final dashboard (unsurprisingly, looker is very similar to powerBI). Looks awesome. Can't wait to apply what I learned to data of my choosing.

Lastly, I deleted everything from GCP, but I know what I need next time. ^^


Just a quick mention - today I kept reviewing my intro that I wrote yesterday, and added a table which I can share. 

All of these papers' summaries/notes I have shared throughout the last few weeks, and I was taught that tables like that (and even more detailed) are a good idea in a paper.


That is all for today!

See you tomorrow :)

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[미리 공부] 기초 통계 복습 (Day 1는 1월2일)

[Day 61] Stanford CS224N (NLP with DL): Machine translation, seq2seq + a side CDCGAN mini project