[Day 204] Transaction data EDA + MLflow & minIO docker setup

 Hello :)
Today is Day 204!


A quick summary of today:
  • some EDA on the KB AI competition data
  • setting up mlflow and minIO


Firstly, about doing some basic cleaning and EDA on my part of the data for the Kukmin Bank project

These are the variables assigned to myself: trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state


No missing/null values. 

Box plot of log(amount) in Not Fraud vs Fraud

Distribution of Not Fraud vs Fraud


Other graphs






Interesting ~ only Fraud transactions in the state of Delaware (of course this is not real data, but interesting nonetheless)

We should also do some basic transformation and cleanup before the raw data goes to the db. In my columns, an example case is: all merchant's name start with fraud_, so removing it would be fine, just for a bit more clarity. 


On another note ~ mlflow and minIO

I found this website (in Korean but can be translated) that provides an easy "plug-n-play" Dockerfile and docker-compose serivces code for setting up mlflow, and minIO as an artifact and backend store. I have never used minIO before but from a quick online search (before using it) it seemed like a UI similar to any cloud provider's storage, and is built on AWS so it seems that it is scalable too.

Following the guide I set up the Dockerfile and the services for mlflow-backend-store

mlflow-artifact-store

mlflow-server

ran it and it works fine. This was the 1st time I saw the UI of minIO

It is empty for now, but reminds me of AWS S3 and GCS

From the setup, there is already a created bucket as well:



Another thing related to kafka. For some reason the setup I had on this project was using some schema-registry image which was ~1.5GB and taking lots of my space in Docker, and it stopped working, so I switched the services in my docker-compose file with the same as from my transaction-stream-data-pipeline project. At the moment my docker containers are:



That is all for today!

See you tomorrow :)

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[미리 공부] 기초 통계 복습 (Day 1는 1월2일)

[Day 61] Stanford CS224N (NLP with DL): Machine translation, seq2seq + a side CDCGAN mini project