Posts

Showing posts from August, 2024

Continuing in a new blog https://ivanstudyblog.github.io/

  https://ivanstudyblog.github.io/ Hello :)  After 220 days posting here, I am moving my blog to  https://ivanstudyblog.github.io/ Please head onto there for the latest days The UI there is a bit more flexible and customisable so I will continue my learning journey there.  That is all for today :)  See you in the new blog. 

[Day 220] Chapter 2 The Data Engineering Lifecycle

Image
 Hello :) Today is Day 220! A quick summary of today: read chapter 2 from the book 'Fundamentals of Data Engineering' transferred more posts onto the new blog UI What Is the Data Engineering Lifecycle? The data engineering lifecycle by getting data from source systems and storing it. Next, we transform the data and then proceed to our central goal, serving data to analysts, data scientists, ML engineers, and others. In reality, storage occurs throughout the lifecycle as data flows from beginning to end. There are 5 stages: Generation Storage Ingestion Transformation  Serving data Generation: Source Systems Sources produce data consumed by downstream systems, including humangenerated spreadsheets, IoT sensors, and web and mobile applications. Each source has its unique volume and cadence of data generation. A data engineer should know how the source generates data, including relevant quirks or nuances. Data engineers also need to understand the limits of the source systems they

[Day 219] Fundamentals of Data Eng and LLM data preprocessing pipelines in Mage

Image
 Hello :) Today is Day 219! A quick summary of today: starting 'Fundamentals of Data Engineering' covered module 5: orchestration of the LLM zoomcamp When I woke up today I saw that DeepLearning.AI is launching a new course - DE Professional Certificate at the end of August. The instructor - Joe Reis, is one of the writers of the infamous 'holy book' of DE - Fundamentals of Data Engineering. Thankfully, I found the book officially published for free by Redpand . Below is a summary of what I read today (Chapter 1). What is Data Engineering? There is a lot of definitions of the term, but all have a similar idea. The book combines it all in this one: Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engi‐ neering is the intersection of security, data management, DataOps, data arch

[Day 218] ML canvas for the KB fraud transaction detection project

Image
 Hello :) Today is Day 218! A quick summary of today: creating an ML canvas for the KB project rewatching LLM zoomcamp module 4 - monitoring Just like with the fraud insurance claims project , I created an ML canvas: Finally, I can share the repo as it is public. Here is the link The ML canvas took most my time today and also just small checks to make sure that the repo can be easily reproduced using the current code and instructions. Today I also rewatched 4th module on monitoring for LLMs from the LLM zoomcamp. How come? I saw that the 5th module is finished, and before I cover it I wanted to check everything from the 4th module again. The 5th module covers orchestration using my good ol' friend Mage.  That is all for today! See you tomorrow :)

[Day 217] KB project meeting and reading bank telemarketing papers

Image
 Hello :) Today is Day 217! A quick summary of today: talking about the KB project reading papers related to bank telemarketing models Today my KB project partner and I met to catch up. We looked over what he had done during the past week. In addition, we talked about what happens after developing a model in a jupyter notebook - creating a script that takes a new row of features, preprocessing them, and then putting that through the model to get a prediction for that new case. Then we went over the below graph Giving him an overview of what each technology does. He is going on vacation from tomorrow to Friday, so the tasks left are to write up the project ppt that is needed for the submission in Korean.  In addition, for reproducability - I noticed that the kafka-producer uses local data to send transactions. For better reproducability, I uploaded the dataset to HuggingFace and used in the code I am reading the csv from the HF url I also read some papers related to bank term deposit te

[Day 216] Pipelines for XGBoost and CatBoost training, and using the models in the real-time inference pipeline

Image
 Hello :) Today is Day 216! A quick summary of today: created pipelines to train XGBoost and CatBoost models added XGBoost and Catboost models' predictions in the real-time inference pipeline update model dictionary UI created project README The project submission deadline is 11th of Aug, and after that we will make the repo public, but until then I just have to share pictures from the project.  All pipelines including today's: Setting up model training pipes was easy, the tricky part was using the models. Because the models require OHE data (or dummy vars), so the approach I ended up with was from the development notebook, I took all the columns that are used for the model training data, to emulate a OHE dataset, then I created a dict with key: col, value:0. When a message gets processed, the code goes over the data, and updates the 0 to a 1 if a particular value is there. It is easier to understand with an example - if the transaction has category: entertainment, then in the

[Day 215] Trying out 'traditional' models on the KB project transaction fraud data

Image
 Hello :) Today is Day 215! A quick summary of today: trying out different models to classify non fraud/fraud transactions In addition to the Graph Convolutional Network model, I wanted to create some 'traditional' (non-neural net) models for better comparison and judgement.  The features I used for the below models are: numerical: amt (amount) categorical: category, merchant, city, state, job, trans_hour, trans_dow Then I downsampled the majority class (non fraud cases; just like for the GCN model). Logistic Regression Best params: {'C': 1, 'penalty': 'l1', 'solver': 'liblinear'} CatBoost Best params: {'depth': 6, 'iterations': 300, 'learning_rate': 0.3} XGBoost Best params: {'learning_rate': 0.3, 'max_depth': 6, 'n_estimators': 300, 'subsample': 0.9} RandomForest Best params: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 200} LGBMClassifier Bes

[Day 214] The evaluation of my MLOps zoomcamp project arrived - max points

Image
 Hello :) Today is Day 214! A quick summary of today: documenting the KB project data using arrow.app small repo changes ML zoomcamp announcement I tried to find some graph documentation tool for the KB project. The most popular I could find is this arrows.app application where we can create sample nodes with edges between and add attributes with descriptions. I created the below pic using that app Also I made some small changes to the repo: the readme I added: Docker - used to containarise and self-host the below services Mlflow - used for easy model comparison during development. Can be improved by using cloud services (like GCP) for hosting, database and artifact store Mage - used for pipeline orchestration of the model training and real-time inference pipelines. Can be improved by hosting mage on the cloud Neo4j - used as a Graph database to store transaction data as nodes and edges. Can be improved by using Neo4j's AuraDB (hosted on the cloud) Kafka - used to ensure real-time

[Day 213] Creating a grafana dashboard for the KB project

Image
 Hello :) Today is Day 213! A quick summary of today: using Cypher to create a grafana dashboard for my neo4j database After setting up neo4j's plugin in grafana a few days ago today I decided to create a live dashboard in grafana. The result: Each graph requires using Cypher (neo4j's query language). One cool graph I was surprised I could make is the geography map which uses the long and lat of merchants where there was fraud. What else is cool is that the graph updates as new transactions come in. It can update at different intervals, but I tested the 5 second one, and as new data comes in from kafka and sent to the db, the dashboard reflects it. What is more, this time, I saved the dashboard as a json are uploaded it to github. As seen from the next pic, besides the graph, the other things I did today are a bit trivial - just updating variable names, clean up data preprocessing and make the Mage pipelines better.  The deadline for the project is 11th of Aug, so till then the