[Day 133] Gathering data for the Scottish dataset project + Factor analysis + Grokking ML + MLxFundamentals Day 4 (2)

 Hello :)
Today is Day 133!


A quick summary of today:
  • started putting together some audio clips + transcription for the Scottish dataset project
  • saw how eigenvalues play a role in factor analysis
  • started reading Grokking Machine Learning on Manning.com
  • finished MLx Fundamentals' final session

Firstly, about the audio data

My Scottish partner for this project has recorded various phrases in Glaswegian in the past and uploaded them to youtube. Today I did 4 of the 10 videos. 

To cut the clips I ended up using an app called VideoPad, and even though it is a paid app, it allows me to just cut an audio clips in smaller pieces and save them as new files. 

This is a sample audio waveform of one of the 4 videos

What I did was, make short clips around each expression. I am not sure what these waves are called when there is speech. So for example from the above, I ended up with 36 clips, and uploaded them all (along with the other 3 videos' audio clips) to our project's drive. And the total amount we have so far is 4.75 minutes.  

Secondly, today I read about eigenvalues' role in factor analysis

In my stats class at uni, we learned about factor analysis, and at the end of the chapter I saw the word eigenvalues, and I am glad because once again I will see their real world impact (after my dive into multicollinearity). 

Firstly, about factor analysis, here are the results after using unstandardized variables

And after standardization

Why standardize?

  • result interpretability
  • helps with linearity
  • treats variables equally
Where are the eigenvalues? The overall under each loading. 1.981 and 1.008. Which are sum of the square of each of the 3 values above it. 

To interpret this, we take the 1st row and F1: 0.00089222 - this is F1 accounts for 0.0089% of the variance in Y1 which is Finance, while F2 accounts for 99.90%. And in total, the 2 factor space accounts for 99.99% of the variance in Y1 Finance. 

 The eigenvalues can help to determine which factors to keep (i.e. using scree plots).

Love it when I see the math I studied for ML being used in practice, and *where* it is used.


Thirdly, about Grokking ML

I decided to subscribe to manning.com and the 1st book I decided to read was Grokking ML (as it is one of the most recommended and popular ones). Today I managed to read the 1st 4 chapters(what is ML, types of ML, linear regression, optimization), and I can definitely see why it is popular for beginners, and I am excited to keep reading. 


Finally, the last session from MLxFundamentals was delivered by Wenhan Han a PhD candidate from TU Eindhoven.

It was about loading and using an LLM, and a diffusion model

Some interesting bits from both parts are:

Question to an LLM:

How many kinds of human beings are there in the history?

We saw the top answers from the model. 

Various prompt strategies

Zero-shot

Few shot

Chain-of-thought

Then, how to finetune an LLM with 'unsloth'

We add LoRA adapters
Prepare the data
Data was from huggingface

Train the model
And then inference

For finetuning I need to try unsloth by myself. For some part of the tutorial, an openai api key was required which was unfortunate because I do not have one, so I just watched that part. 

As for diffusion models


I saw the library 'diffusers' was used which I did not know.

And I played around with it a bit, a picture I tried to get and failed was 'a horse riding an astronaut'


That is all for today!

See you tomorrow :)

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[Day 107] Transforming natural language to charts

[Day 54] I became a backprop ninja! (woohoo)