Posts

Showing posts from 2024

Continuing in a new blog https://ivanstudyblog.github.io/

  https://ivanstudyblog.github.io/ Hello :)  After 220 days posting here, I am moving my blog to  https://ivanstudyblog.github.io/ Please head onto there for the latest days The UI there is a bit more flexible and customisable so I will continue my learning journey there.  That is all for today :)  See you in the new blog. 

[Day 220] Chapter 2 The Data Engineering Lifecycle

Image
 Hello :) Today is Day 220! A quick summary of today: read chapter 2 from the book 'Fundamentals of Data Engineering' transferred more posts onto the new blog UI What Is the Data Engineering Lifecycle? The data engineering lifecycle by getting data from source systems and storing it. Next, we transform the data and then proceed to our central goal, serving data to analysts, data scientists, ML engineers, and others. In reality, storage occurs throughout the lifecycle as data flows from beginning to end. There are 5 stages: Generation Storage Ingestion Transformation  Serving data Generation: Source Systems Sources produce data consumed by downstream systems, including humangenerated spreadsheets, IoT sensors, and web and mobile applications. Each source has its unique volume and cadence of data generation. A data engineer should know how the source generates data, including relevant quirks or nuances. Data engineers also need to understand the limits of the source systems they

[Day 219] Fundamentals of Data Eng and LLM data preprocessing pipelines in Mage

Image
 Hello :) Today is Day 219! A quick summary of today: starting 'Fundamentals of Data Engineering' covered module 5: orchestration of the LLM zoomcamp When I woke up today I saw that DeepLearning.AI is launching a new course - DE Professional Certificate at the end of August. The instructor - Joe Reis, is one of the writers of the infamous 'holy book' of DE - Fundamentals of Data Engineering. Thankfully, I found the book officially published for free by Redpand . Below is a summary of what I read today (Chapter 1). What is Data Engineering? There is a lot of definitions of the term, but all have a similar idea. The book combines it all in this one: Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engi‐ neering is the intersection of security, data management, DataOps, data arch

[Day 218] ML canvas for the KB fraud transaction detection project

Image
 Hello :) Today is Day 218! A quick summary of today: creating an ML canvas for the KB project rewatching LLM zoomcamp module 4 - monitoring Just like with the fraud insurance claims project , I created an ML canvas: Finally, I can share the repo as it is public. Here is the link The ML canvas took most my time today and also just small checks to make sure that the repo can be easily reproduced using the current code and instructions. Today I also rewatched 4th module on monitoring for LLMs from the LLM zoomcamp. How come? I saw that the 5th module is finished, and before I cover it I wanted to check everything from the 4th module again. The 5th module covers orchestration using my good ol' friend Mage.  That is all for today! See you tomorrow :)

[Day 217] KB project meeting and reading bank telemarketing papers

Image
 Hello :) Today is Day 217! A quick summary of today: talking about the KB project reading papers related to bank telemarketing models Today my KB project partner and I met to catch up. We looked over what he had done during the past week. In addition, we talked about what happens after developing a model in a jupyter notebook - creating a script that takes a new row of features, preprocessing them, and then putting that through the model to get a prediction for that new case. Then we went over the below graph Giving him an overview of what each technology does. He is going on vacation from tomorrow to Friday, so the tasks left are to write up the project ppt that is needed for the submission in Korean.  In addition, for reproducability - I noticed that the kafka-producer uses local data to send transactions. For better reproducability, I uploaded the dataset to HuggingFace and used in the code I am reading the csv from the HF url I also read some papers related to bank term deposit te

[Day 216] Pipelines for XGBoost and CatBoost training, and using the models in the real-time inference pipeline

Image
 Hello :) Today is Day 216! A quick summary of today: created pipelines to train XGBoost and CatBoost models added XGBoost and Catboost models' predictions in the real-time inference pipeline update model dictionary UI created project README The project submission deadline is 11th of Aug, and after that we will make the repo public, but until then I just have to share pictures from the project.  All pipelines including today's: Setting up model training pipes was easy, the tricky part was using the models. Because the models require OHE data (or dummy vars), so the approach I ended up with was from the development notebook, I took all the columns that are used for the model training data, to emulate a OHE dataset, then I created a dict with key: col, value:0. When a message gets processed, the code goes over the data, and updates the 0 to a 1 if a particular value is there. It is easier to understand with an example - if the transaction has category: entertainment, then in the

[Day 215] Trying out 'traditional' models on the KB project transaction fraud data

Image
 Hello :) Today is Day 215! A quick summary of today: trying out different models to classify non fraud/fraud transactions In addition to the Graph Convolutional Network model, I wanted to create some 'traditional' (non-neural net) models for better comparison and judgement.  The features I used for the below models are: numerical: amt (amount) categorical: category, merchant, city, state, job, trans_hour, trans_dow Then I downsampled the majority class (non fraud cases; just like for the GCN model). Logistic Regression Best params: {'C': 1, 'penalty': 'l1', 'solver': 'liblinear'} CatBoost Best params: {'depth': 6, 'iterations': 300, 'learning_rate': 0.3} XGBoost Best params: {'learning_rate': 0.3, 'max_depth': 6, 'n_estimators': 300, 'subsample': 0.9} RandomForest Best params: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 200} LGBMClassifier Bes

[Day 214] The evaluation of my MLOps zoomcamp project arrived - max points

Image
 Hello :) Today is Day 214! A quick summary of today: documenting the KB project data using arrow.app small repo changes ML zoomcamp announcement I tried to find some graph documentation tool for the KB project. The most popular I could find is this arrows.app application where we can create sample nodes with edges between and add attributes with descriptions. I created the below pic using that app Also I made some small changes to the repo: the readme I added: Docker - used to containarise and self-host the below services Mlflow - used for easy model comparison during development. Can be improved by using cloud services (like GCP) for hosting, database and artifact store Mage - used for pipeline orchestration of the model training and real-time inference pipelines. Can be improved by hosting mage on the cloud Neo4j - used as a Graph database to store transaction data as nodes and edges. Can be improved by using Neo4j's AuraDB (hosted on the cloud) Kafka - used to ensure real-time

[Day 213] Creating a grafana dashboard for the KB project

Image
 Hello :) Today is Day 213! A quick summary of today: using Cypher to create a grafana dashboard for my neo4j database After setting up neo4j's plugin in grafana a few days ago today I decided to create a live dashboard in grafana. The result: Each graph requires using Cypher (neo4j's query language). One cool graph I was surprised I could make is the geography map which uses the long and lat of merchants where there was fraud. What else is cool is that the graph updates as new transactions come in. It can update at different intervals, but I tested the 5 second one, and as new data comes in from kafka and sent to the db, the dashboard reflects it. What is more, this time, I saved the dashboard as a json are uploaded it to github. As seen from the next pic, besides the graph, the other things I did today are a bit trivial - just updating variable names, clean up data preprocessing and make the Mage pipelines better.  The deadline for the project is 11th of Aug, so till then the

[Day 212] Final Glaswegian TTS model

Image
 Hello :) Today is Day 212! A quick summary of today: creating a simple Glaswegian assistant app After finetuning whisper to get the final version of the Glaswegian ASR model, the next task was to do a final fine-tuning on the T5 speech model to get a final version of the Glaswegian TTS model. Well I did that today.  Here is a link to the model on HuggingFace. And its training results: Now that we have the final 2 hour dataset, I was hoping for better results. Before, the generated audio (while with a little accent) sounded robotic. First thing I had to do was fix the HuggingFace space where the previous version of the glaswegian TTS was running. The issue was related to voice embeddings, and after a quick fix ~ It was up again, and I loaded the latest glaswegian_tts model. Well now, it *does* sound better. There are cases where it is robotic, but there is definitely improvement compared to the previous version. That previous version was trained or around 30 mins of audio, compared to

[Day 211] 2 hour mark !!! Glaswegian dataset goal - accomplished! + whisper-small fine-tuned

Image
 Hello :) Today is Day 211! A quick summary of today: fine-tuning the final whisper-small model for Glaswegian  automatic speech recognition (ASR) After yesterday, it the audio was 118 and I was not happy that we did not *officially* hit the 2 hour mark, so today I went on youtube and got a 3min clip from Limmy which I transcribed and added to our dataset. Here is a link to the final 2 hour Glaswegian dataset on HuggingFace. After a few months of working on this side project we hit the set goal!~ After I got the dataset I started fine-tuning whisper-small. It took around 4 hours ~ and the fine-tuned model is ready and deployed for using on HuggingFace Spaces .  Looking at the metrics, training could be improved.  As for what is next - training a Text-To-Speech model using our final dataset. This was the hard part before because the voice at the end was, while understandable, a bit robotic. So this is the next step and hopefully we get better results now.  That is all for today! See y

[Day 210] 118 minutes of Glaswegian accent audio clips

Image
 Hello :) Today is Day 210! A quick summary of today: final audio clips preprocessing to reach our audio dataset mark Final dataset for the glaswegian voice assistant AI ( link to HuggingFace ). Today I preprocessed the final audios from 2 of Limmy's youtube videos (Limmy accidentaly kills the city and The writer of Saw called Limmy a ...).  Just an update on how the process goes now ~  Since our transcription AI is pretty good (according to my Glaswegian speaking project partner), we pass the full raw audio to our fine-tuned whisper  model hosten on HuggingFace spaces. Then the transcript is put into a docs file (where first I check over it for obious mistakes and flag if I see something odd and cannot understand it from re-listening to the audio) and split into sensible (small) bits while listening to the audio, like: (this is the start from Limmy accidentaly kills the city) Then using an audio tool, I cut the full audio length into clips according to the cut text, then I match c

[Day 209] Using Mage for pipeline orchestration in the KB project

Image
 Hello :) Today is Day 209! A quick summary of today: creating Mage pipelines for the KB AI competition project The repo after today Today morning/afternoon I went on a bit of a roll ~  I set up all the above pipelines in Mage. Below I will go over each one get_kaggle_data it is just one block that downloads the data for the project from Kaggle using the kaggle python package load_batch_into_neo4j Gets the loaded data from the get_kaggle_data pipeline and inserts it into neo4j. This is the fraudTrain.csv from the Kaggle website (because the fraudTest.csv will be used for the pseudo-streaming pipeline). train_gcn I tried to split this into more blocks, but at the moment the way I structured the code, the most optimal solution was to do it all at once. That is - create a torch-geometric dataset, create node and edge index, train the model, test it, and save summary info. At the moment, because everything is local, I am using mlflow just for easy comparison but later (after the project s