[Day 121] Uncovering the full reason behing multicollinearity + Frequent itemset mining lecture

 Hello :)
Today is Day 121!


A quick summary of today:


Whenever I read books, or see blogs, ask an LLM, what happens when there is multicollinearity - the same generic answer is given: multicollinearity causes the estimated parameters to be unreliable. I always wanted to ask why? and had a ton of questions that follow. Well today I finally 'got my hands dirty' and got into the depths of it. Not only does it feel awesome to uncover the truth, but also I am seeing some of the math concepts that I learned about - eigenvalues/vectors, SVD, matrix's rank, determinant and condition number actually being used. 

I wrote my notes, which I shared on r/learnmachinelearning to get opinions and feedback. But also I put it all in a colab/kaggle notebook.

Here is the gist

Building up to multicollinearity

  • Eigenvalues and Eigenvectors: when an eigenvalue is close to 0, it indicates that the associated eigenvector does not scale much during the applied transformation
  • Minimal scaling: the vector does not scale (much), this means the transformation has very little effect along the direction of that eigenvector (little effect on the features or characteristics of the data that are aligned with that particular eigenvector)
  • Redundancy: if an eigenvector is nearly parallel to the null space of the transformation matrix, this suggests there is a linear combination of the data (features in the design matrix) that has no effect on the outcome
  • Multicollinearity: redundancy (or near-redundancy) in the data is a characteristic of multicollinearity - one predictor variable can be approximately expressed as a linear combination of the others. If we have multicollinearity, there is near linear dependence and the matrix is (near) singular, which makes X^TX ill-conditioned and numerically unstable (sensitive to pertrubations) which inflates the parameter variances, resulting in poor estimates

In the notebook there are some code examples as well showing the eigenvalues, rank, condition number and a logistic regression when THERE IS and there is NOT multicollinearity.

Below are the results from the final linear regression:

  • There is multicollinearity

Multicollinearity case:

eigenvalues=array([  0., 275.])

Rank of the matrix: 1

condition_number=inf


Original Coefficients and Intercept:

Coefficients: [0.80039912 1.60079824]

Intercept: 0.03991352436920792


Perturbed Coefficients and Intercept:

Coefficients: [-8.6065131   6.30414128]

Intercept: 0.03940363891543086

  • There is NO multicollinearity

No multicollinearity case:

eigenvalues=array([  1.9623732, 188.0376268])

Rank of the matrix: 2

condition_number=95.82154225314864


Original Coefficients and Intercept:

Coefficients: [2.05446569 0.92175784]

Intercept: 0.15618973818845738


Perturbed Coefficients and Intercept:

Coefficients: [2.05486836 0.92235525]

Intercept: 0.1525767124126478


The above showcases one of the consequences of multicollinearity - making the parameters sensitive to even small changes. For the full code I recommend visiting the kaggle colab (it is a quick read). 


Secondly, my notes from the lecture on frequent itemset mining

Covered topics:

Frequent itemsets, association rule mining, finding frequent itemsets, finding frequent pairs, A-Priori algorithm, PCY (Park-Chen-Yu) algorithm, frequent itemsets in <= passes


That is all for today!

See you tomorrow :)

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[미리 공부] 기초 통계 복습 (Day 1는 1월2일)

[Day 61] Stanford CS224N (NLP with DL): Machine translation, seq2seq + a side CDCGAN mini project