[Day 160] Simple data engineering pipeline with Prefect, and... MLOps with mage.ai (tons of problems)

 Hello :)
Today is Day 160!


A quick summary of today:
  • simple data engineering pipeline with prefect
  • tons of trouble learning about orchestration with mage.ai


After yesterday's journey with prefect the youtube algorithm recommended me another tutorial for prefect - this time for creating data pipelines with prefect. So I decided to give it a go. 

What is data engineering?

  • data scientists can do data engineering, but in specific cases where the two jobs cannot or are not needed to be separate
  • data engineers build databases, they build lots of data pipelines and manage infrastructure (also care about cost, security)
What are data pipelines?
  • ETL(ELT)/batch pipelines that move data from A to B
    • databases, APIs, files
  • streaming pipelines - as data comes in, we consume that data and send it wherever it needs to go 
    • message queues, polled data
The main github repo used is here.

After some basic setup, when we run 'pipeline' in the terminal which runs the main.py file:

In prefect we get a flow run

and logged outputs from the petstore url

Now for a flow with a bit more tasks~

1st task: retrieve from API

2nd task: clean data

3rd task: insert to postgres db

final flow:

Success!
We can also see the data loaded into postgres:

Beege (the teacher) showed also how to do simple task tests and avoid a common prefect error when executing prefect task functions outside a prefect flow



Prefect is amaing, but the MLOps zoomcamp course works with Mage.ai this year, and after I gave up on it due to errors that I could not fix last time - I decided to give it another go today. And... omg so many errors and problems - I almost gave up on it several times. The videos are recorded 2 weeks ago, and yet there is so much different on my UI (and other students' UI) that it is mind-boggling.

The content was not much, but the constant errors... made me invest upwards of 12 hours of bug fixing for a total of 30 minutes of video content. 


Orchestration with Mage.ai from MLOps zoomcamp module 3

The content of the intro to orchestration is:

I have to do 3.5 tomorrow (hopefully). 
I just want to say a huge thank you to the QnA bot on the mlops zoomcamp slack channel that at least pointed me into the right direction to solve my issues. Just quickly - I am bit sad that there are so many issues (almost every video) because it will drive people away from this amazing course, and below I will just provide my successes, and if you are really curious about my problems - they are all on the datatalks slack channel haha. 

I started from the beginning

3.1 Data preparation

I created the above pipeline that reads NY taxi data, does a bit of preprocessing, and then outputs X, X_train, X_val, y, y_train, y_val, dv. By the way, each block is a piece of code.

3.2 Training

First, create a pipeline to train a linear regression and lasso model. It takes data from the above 3.1 pipeline, loads models, does hparam search, and finally trains the two models.

Secondly, I had to create a pipeline for an xgboost model
The last (pink) bit is connected to creating visualisations. 

3.3 Observability

We can create any kind of bar/flow/line/custom charts, and the above 3 are some SHAP values from the xgboost model. 

3.4 Triggering

Here I created an automatic trigger to train the xgboost model when new data is detected.

Also, created this predict pipeline

There is a nice way to setup a basic interface for inference

We can also setup an API to do inference through that as well:

That was all for today. I have very strong feelings towards Mage.ai but for now I will roll with it because of the nice course. Otherwise I will try to use prefect for some side project (have to think about that a bit). 


That is all for today!

See you tomorrow :)

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[미리 공부] 기초 통계 복습 (Day 1는 1월2일)

[Day 61] Stanford CS224N (NLP with DL): Machine translation, seq2seq + a side CDCGAN mini project