50+ days of Machine Learning

Posts

Showing posts from July, 2024

[Day 212] Final Glaswegian TTS model

7/31/2024 07:29:00 pm

Hello :) Today is Day 212! A quick summary of today: creating a simple Glaswegian assistant app After finetuning whisper to get the final version of the Glaswegian ASR model, the next task was to do a final fine-tuning on the T5 speech model to get a final version of the Glaswegian TTS model. Well I did that today. Here is a link to the model on HuggingFace. And its training results: Now that we have the final 2 hour dataset, I was hoping for better results. Before, the generated audio (while with a little accent) sounded robotic. First thing I had to do was fix the HuggingFace space where the previous version of the glaswegian TTS was running. The issue was related to voice embeddings, and after a quick fix ~ It was up again, and I loaded the latest glaswegian_tts model. Well now, it *does* sound better. There are cases where it is robotic, but there is definitely improvement compared to the previous version. That previous version was trained or around 30 mins of audio, co...

[Day 211] 2 hour mark !!! Glaswegian dataset goal - accomplished! + whisper-small fine-tuned

7/30/2024 10:34:00 pm

Hello :) Today is Day 211! A quick summary of today: fine-tuning the final whisper-small model for Glaswegian automatic speech recognition (ASR) After yesterday, it the audio was 118 and I was not happy that we did not *officially* hit the 2 hour mark, so today I went on youtube and got a 3min clip from Limmy which I transcribed and added to our dataset. Here is a link to the final 2 hour Glaswegian dataset on HuggingFace. After a few months of working on this side project we hit the set goal!~ After I got the dataset I started fine-tuning whisper-small. It took around 4 hours ~ and the fine-tuned model is ready and deployed for using on HuggingFace Spaces . Looking at the metrics, training could be improved. As for what is next - training a Text-To-Speech model using our final dataset. This was the hard part before because the voice at the end was, while understandable, a bit robotic. So this is the next step and hopefully we get better results now. That...

[Day 210] 118 minutes of Glaswegian accent audio clips

7/29/2024 10:33:00 pm

Hello :) Today is Day 210! A quick summary of today: final audio clips preprocessing to reach our audio dataset mark Final dataset for the glaswegian voice assistant AI ( link to HuggingFace ). Today I preprocessed the final audios from 2 of Limmy's youtube videos (Limmy accidentaly kills the city and The writer of Saw called Limmy a ...). Just an update on how the process goes now ~ Since our transcription AI is pretty good (according to my Glaswegian speaking project partner), we pass the full raw audio to our fine-tuned whisper model hosten on HuggingFace spaces. Then the transcript is put into a docs file (where first I check over it for obious mistakes and flag if I see something odd and cannot understand it from re-listening to the audio) and split into sensible (small) bits while listening to the audio, like: (this is the start from Limmy accidentaly kills the city) Then using an audio tool, I cut the full audio length into clips according to the cut text,...

[Day 209] Using Mage for pipeline orchestration in the KB project

7/28/2024 09:41:00 pm

Hello :) Today is Day 209! A quick summary of today: creating Mage pipelines for the KB AI competition project The repo after today Today morning/afternoon I went on a bit of a roll ~ I set up all the above pipelines in Mage. Below I will go over each one get_kaggle_data it is just one block that downloads the data for the project from Kaggle using the kaggle python package load_batch_into_neo4j Gets the loaded data from the get_kaggle_data pipeline and inserts it into neo4j. This is the fraudTrain.csv from the Kaggle website (because the fraudTest.csv will be used for the pseudo-streaming pipeline). train_gcn I tried to split this into more blocks, but at the moment the way I structured the code, the most optimal solution was to do it all at once. That is - create a torch-geometric dataset, create node and edge index, train the model, test it, and save summary info. At the moment, because everything is local, I am using mlflow just for easy comparison but later (after the ...

[Day 208] Setting up docker-services for the KB project, streaming transactions, and the Scottish dataset

7/27/2024 11:41:00 pm

Hello :) Today is Day 208! A quick summary of today: making progress with the Kukmin Bank (KB) project Scottish dataset 2hr audio mark is close First, about the KB project for the KB AI competition All code from today is on the project's repo on branch ivan . Today I managed to set up the (what I think are the) needed services going forward. I hit the Docker storage space limit a few times, so I had to compromise on complexixty. For example instead of using postgres as a db for mlflow, even using a simple sqlite3 db for this project is fine. Later, plugging in a better storage for mlflow, and its artifacts is trivial. One new thing, that was not there before is grafana. I saw online that there is a neo4j grafana plugin - so I tried to install it. At first, by default, I could not, so I looked in grafana cloud (free for 14 days) and on there, the plugin could easily be loaded, and I could easily connect to my neo4j db. However, I wanted to do it locally, so I found a way to install...

[Day 207] Finished with neo4j (for now) and thinking about fraud detection models

7/26/2024 09:41:00 pm

Hello :) Today is Day 207! A quick summary of today: setting up neo4j reading some papers on bank telemarketing classification Firstly, about the Kukmin Bank (KB) AI competition project We set up the database on my partner's laptop insert CreditCard nodes insert Merchant nodes insert Transaction edge We also had a look at the final EDA notebook. At the moment the repo looks like: As for next steps, we will start developing models, and try to get something that can identify Fraud transactions well. And more importantly, explainability - what features help the model determine that a transaction is fraud. So we will start with some basic models like logistic reg, decision trees, random forest, then add hyperparam tuning, over/undersampling, etc. And try to get a model that detects Fraud well. As I want to try out a GNN for this project's model. I saw there is a GNNExplainer by torch-geometric, so I need to try using it in practice and see if the explainability it provides...

[Day 206] Finishing the Stock Market Analysis zoomcamp (for now)

7/26/2024 12:03:00 am

Hello :) Today is Day 206! A quick summary of today: last homework from Stock Market Analysis Zoomcamp Uploading neo4j scripts to the KB project repo Over the past 4 weeks, my lab mate (Jae-Hyeok, the same guy with whom I am doing the KB project) and I have been covering another couse by DataTalksClub - the Stock Market Analysis Zoomcamp . The course introduces basic concepts and strategies for creating models that can potentially invest in stocks. Today was the 4th week since we started it and was the last homework, related to working with a financial model workbook to modify and analyse stocks, focusing on Random Forest tuning, reducing feature sets (to see difference in results), predicting strong future growth (investing in cases where the stock growth is above a fixed treshold), and developing an ideal trading strategy. Above is an example resulting graph from one of the questions. We compare models based on CAGR (Compound annual growth rate), and there are various models - r...

[Day 205] Going back to a basic mlflow service and another meeting for the KB project

7/24/2024 11:40:00 pm

Hello :) Today is Day 205! A quick summary of today: no minIO, just plain old postgres and local storage for mlflow helping my friend better understand github's workflow Yesterday I mentioned I set up minIO as an artifact store for mlflow ~ Well today I removed it haha. The problem was that i wanted some extra AWS credentials in order to work, and to get that I need a working AWS account. I can easilty set that up again (because I deleted my aws account recently), but in order for my project collaborator to set it up, he needs to do it as well. We have enough work as it is, and this extra tech is not essential so I just removed it and set up mlflow just using a local postgres docker service and a folder as the artifact store. Maybe later, we can swap out this whole mlflow to run on GCP but it is not crucial if we leave it as is for now. While setting up mlflow and testing the connection, I was reminded of the value of "network" in dockere-compose. I had the other servi...

[Day 204] Transaction data EDA + MLflow & minIO docker setup

7/23/2024 11:03:00 pm

Hello :) Today is Day 204! A quick summary of today: some EDA on the KB AI competition data setting up mlflow and minIO Firstly, about doing some basic cleaning and EDA on my part of the data for the Kukmin Bank project These are the variables assigned to myself: trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state No missing/null values. Box plot of log(amount) in Not Fraud vs Fraud Distribution of Not Fraud vs Fraud Other graphs Interesting ~ only Fraud transactions in the state of Delaware (of course this is not real data, but interesting nonetheless) We should also do some basic transformation and cleanup before the raw data goes to the db. In my columns, an example case is: all merchant's name start with fraud_, so removing it would be fine, just for a bit more clarity. On another note ~ mlflow and minIO I found this website (in Korean but can be translated) that provides an easy "plug-n-play" Dockerfile and docker-compose ...

[Day 203] Starting LLM zoomcamp module 4 - Monitoring

7/22/2024 11:29:00 pm

Hello :) Today is Day 203! A quick summary of today: meeting with my lab mate for the KB AI competition project starting LLM zoomcamp module 4 Today we met to talk about neo4j and doing EDA for the Kukmin Bank AI competition. We talked about the benefits of docker and docker-compose, and how to use neo4j. The next step we will do is EDA. This might sound a bit odd, given that yesterday I created a GNN. Then I just use raw data without many features. The reason being I wanted to make sure I can get even a simple GNN to work. My lab mate suggested to split the columns of the dataset in half, and to see what we can find in terms of needing to change in the dataset. In total there are 22 columns, so we split it down the middle. These are the columns assigned to myself: trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud On another note, today I cov...