[Day 160] Simple data engineering pipeline with Prefect, and... MLOps with mage.ai (tons of problems)

6/09/2024 10:41:00 pm

Hello :)
Today is Day 160!

A quick summary of today:
simple data engineering pipeline with prefect
tons of trouble learning about orchestration with mage.ai

After yesterday's journey with prefect the youtube algorithm recommended me another tutorial for prefect - this time for creating data pipelines with prefect. So I decided to give it a go.

What is data engineering?

data scientists can do data engineering, but in specific cases where the two jobs cannot or are not needed to be separate
data engineers build databases, they build lots of data pipelines and manage infrastructure (also care about cost, security)

What are data pipelines?

ETL(ELT)/batch pipelines that move data from A to B

databases, APIs, files

streaming pipelines - as data comes in, we consume that data and send it wherever it needs to go

message queues, polled data

The main github repo used is here.

After some basic setup, when we run 'pipeline' in the terminal which runs the main.py file:

In prefect we get a flow run

and logged outputs from the petstore url

Now for a flow with a bit more tasks~

1st task: retrieve from API

2nd task: clean data

3rd task: insert to postgres db

final flow:

Success!

We can also see the data loaded into postgres:

Beege (the teacher) showed also how to do simple task tests and avoid a common prefect error when executing prefect task functions outside a prefect flow

Prefect is amaing, but the MLOps zoomcamp course works with Mage.ai this year, and after I gave up on it due to errors that I could not fix last time - I decided to give it another go today. And... omg so many errors and problems - I almost gave up on it several times. The videos are recorded 2 weeks ago, and yet there is so much different on my UI (and other students' UI) that it is mind-boggling.

The content was not much, but the constant errors... made me invest upwards of 12 hours of bug fixing for a total of 30 minutes of video content.

Orchestration with Mage.ai from MLOps zoomcamp module 3

The content of the intro to orchestration is:

I have to do 3.5 tomorrow (hopefully).
I just want to say a huge thank you to the QnA bot on the mlops zoomcamp slack channel that at least pointed me into the right direction to solve my issues. Just quickly - I am bit sad that there are so many issues (almost every video) because it will drive people away from this amazing course, and below I will just provide my successes, and if you are really curious about my problems - they are all on the datatalks slack channel haha.

I started from the beginning

3.1 Data preparation

I created the above pipeline that reads NY taxi data, does a bit of preprocessing, and then outputs X, X_train, X_val, y, y_train, y_val, dv. By the way, each block is a piece of code.

3.2 Training

First, create a pipeline to train a linear regression and lasso model. It takes data from the above 3.1 pipeline, loads models, does hparam search, and finally trains the two models.

Secondly, I had to create a pipeline for an xgboost model

The last (pink) bit is connected to creating visualisations.

3.3 Observability

We can create any kind of bar/flow/line/custom charts, and the above 3 are some SHAP values from the xgboost model.

3.4 Triggering

Here I created an automatic trigger to train the xgboost model when new data is detected.

Also, created this predict pipeline

There is a nice way to setup a basic interface for inference

We can also setup an API to do inference through that as well:

That was all for today. I have very strong feelings towards Mage.ai but for now I will roll with it because of the nice course. Otherwise I will try to use prefect for some side project (have to think about that a bit).

That is all for today!

See you tomorrow :)

Search This Blog

50+ days of Machine Learning