[Day 208] Setting up docker-services for the KB project, streaming transactions, and the Scottish dataset

 Hello :)
Today is Day 208!


A quick summary of today:
  • making progress with the Kukmin Bank (KB) project
  • Scottish dataset 2hr audio mark is close


First, about the KB project for the KB AI competition

All code from today is on the project's repo on branch ivan.

Today I managed to set up the (what I think are the) needed services going forward.

I hit the Docker storage space limit a few times, so I had to compromise on complexixty. For example instead of using postgres as a db for mlflow, even using a simple sqlite3 db for this project is fine. Later, plugging in a better storage for mlflow, and its artifacts is trivial.

One new thing, that was not there before is grafana. I saw online that there is a neo4j grafana plugin - so I tried to install it. At first, by default, I could not, so I looked in grafana cloud (free for 14 days) and on there, the plugin could easily be loaded, and I could easily connect to my neo4j db. However, I wanted to do it locally, so I found a way to install the plugin directly in my docker-compose grafana service definition, like: 


This long url leads to the neo4j plugin download which is automatically downloaded. So with this, I could connect to my docker hosted neo4j db. 

Next, I decided to try to set up the pseudo streaming script that will emulate transactions occurring. 

Every second, it sends 5 transactions to the kafka stream in the form of dictionaries. 
Its Dockerfile:

And requirements.txt

For reproducability, I will try to add version to every package being used, just to ensure that when the project code is in the hands of the judges, if they want, they can reproduce everything. 

Next, I will try to set up Mage (orchestration) a bit more so we have defined pipelines, and more importantly - working pipelines for training a GCN, and using that model for real-time streaming predictions. 

Second, about the Scottish dataset

I have two scripts, the audio of which I need to cut into clips. Along with these we will get to that 2 hour mark which we have been aiming for. Here is the current version, but I hope over the next week to preprocess the audio and get the full dataset on huggingface. 


That is all for today!

See you tomorrow :)

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[미리 공부] 기초 통계 복습 (Day 1는 1월2일)

[Day 61] Stanford CS224N (NLP with DL): Machine translation, seq2seq + a side CDCGAN mini project