Posts

[Day 199] Continuing with Build an LLM from scratch

Image
 Hello :) Today is Day 199! A quick summary of today: saw that all the chapters from the book Build an LLM from scratch  have been published so I decided to continue with it (after a few moths of waiting) I like that even though we are in chapter 5 (out of 7), the author still reminds us the learners the process of going from input text to LLM generated text. Making sure we are still on the same page The goal of this chapter is to train a model because at the moment, when we try to generate some text we get gibberish At the moment the untrained model is given: Every effort moves you ; and the model continues this with  rentingetic wasnم refres RexMeCHicular stren By the way, the current (untrained) model has the following config:  How can the model learn ? It's weights need to be updated so they start to predict the target tokens. Here comes good ol' backpropagation. And it requires a loss function which calculates the dif between desired and actual output (i.e. how ...

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

Image
 Hello :) Today is Day 198! A quick summary of today: data streaming pipeline project [v1 done] Here is a link to the project's repo. Well ... I did not know I can do it in a day (~14 hours) after yesterday's issues but here we are. Turns out in order to insert the full (~70 variables with nested/list structure), I need the proper pyspark schema. And yesterday I did not have that and that is why when I was reading data in the kafka producer I was getting NULL in the columns - my schema was wrong. Well today I not only fixed the schema for the 4 variables I had yesterday, but included *all* the variables that come from the Stripe API ~ 70 (for completeness).  When I run docker-compose, the data streams and is input into the postgres db (and is still running). Unfortunately, the free Stripe API for creating realistic transactions has a limit of 25, so every 3 seconds, 25 new transactions are sent to the db. It has been running half the day (since I got that set up) and as I am w...

[Day 197] Learning about kafka

Image
 Hello :) Today is Day 197! A quick summary of today: finally decided to learn about kafka and streamed data I started the day by watching and replicating this video on  Building a Real-Time Data Streaming Pipeline using Kafka,Postgres and Streamlit Kafka, and in general streaming (data/models) have been in my zone of interest for a while, and this video finally gave me a good starting point. The video helped me set up a Kafka service, a Zookeeper (which manages metadata), a producer (which reads data from a streaming source like a live weather feed and sends it to Kafka), and a consumer (which reads data from Kafka).  The whole thing was around sentiment of generated sentences. First we generate sentences, send them to kafka using the producer, read them from kafka using the consumer and use a sentence sentiment analyser to get a score -> then upload the sentence + score to postgres -> then as each new data comes in, show it in a UI (streamlit). I took some pics b...

[Day 196] Learned about 'ML canvas' and more about MLOps

Image
 Hello :) Today is Day 196! A quick summary of today: Today I found this great resource for MLOps. Below I will summarise the posts I read A better pic of the above is on my repo . I learned about the above 'ML Canvas' concept from the below resources. Motivating MLOps Why MLOps? Machine learning (ML) models are increasingly being used in production environments, but their development and deployment are often disconnected from traditional software development and operations practices. This disconnection leads to various pain points, such as: Lack of collaboration: Data scientists, engineers, and operators work in silos, leading to inefficiencies and errors. Inconsistent workflows: Ad-hoc processes and manual interventions hinder reproducibility, scalability, and maintainability. Inadequate infrastructure: Insufficient infrastructure and tools lead to difficulties in deploying, monitoring, and updating ML models. Motivation for MLOps To address these challenges, MLOps aims to b...

[Day 195] Reading about bank term deposit subscription prediction models

 Hello :) Today is Day 195! One of my lab mates sent me a few papers to skim through to help for his team's project related to predicting bank deposit subscriptions. Thanks to ChatGPT skimming is very easy now. Below are the outputs from ChatGPT on the five papers I got. Predictive Analytics and Machine Learning in Direct Marketing for Anticipating Bank Term Deposit Subscriptions Introduction: Direct marketing is essential for personalized client communication in banking. Predictive analytics and machine learning offer new opportunities for refining marketing strategies. The research aims to enhance direct marketing's effectiveness by applying sophisticated analytical models. Literature Review: Examines eight studies on machine learning and data mining in banking. Highlights methodologies like the S_Kohonen network, Improved Whale Optimization Algorithm, META-DES-AAP, and various machine learning models. Emphasizes the importance of time deposits, customer credit products, and ...

[Day 194] Using Video Generation Models for Taxi OD Demand Matrix Prediction

Image
 Hello :) Today is Day 194! A quick summary of today: I finished the paper for which I read many papers and posted them throughout May/June/July Since May I have been talking about reading research related to predicting OD demand matrix using either graph neural networks or next-frame (video) prediction models. Well fast forward to today ~ and I finished it. Everything is on my github repo . I ran data through 3 models: historical average, ConvLSTM, PredRNN - HA the most common baseline, and ConvLSTM and PredRNN - two of the best next-frame prediction models. Here is the abstract of the paper: Predicting taxi demand is essential for managing urban transportation effectively. This study explores the application of next-frame prediction models—ConvLSTM and PredRNN—to forecast Origin-Destination (OD) taxi demand matrices using a concatenated dataset of NYC taxi data from early 2024. ConvLSTM achieved an RMSE of 1.27 with longer training times, while PredRNN achieved 1.59 with faster t...

[Day 193] Chapter 5, 6, and 7 from Effective Data Science Infrastructure

Image
 Hello :) Today is Day 193! A quick summary of today: covered chapter 5,6, and 7 from Effective Data Science Infrastructure Chapter 5: Practicing scalability and performance Effective infrastructure must accommodate a wide range of projects. Rather than adopting a one-size-fits-all approach, it should offer a versatile toolbox of robust methods to achieve adequate scalability and performance. To enhance organizational scalability and ensure projects are comprehensible to the largest audience, our primary strategy is simplicity. Given that people's understanding is limited overengineering and overoptimizing can cause extra costs. Vertical scalability it refers to the idea of handling more compute and larger datasets just by using larger instances To start things, we begin with a skeleton flow, and then keep adding new things till we get to the final solution.  The model uses Yelp review data and the goal is to group reviews together to find what kind of reviews are general...