[Day 121] Uncovering the full reason behing multicollinearity + Frequent itemset mining lecture

5/01/2024 10:14:00 pm

Hello :)
Today is Day 121!

A quick summary of today:
finally uncovered the full story behind multicollinearity
covered lecture 2 from Stanford's CS246: Mining massive datasets: Frequent itemset mining

Whenever I read books, or see blogs, ask an LLM, what happens when there is multicollinearity - the same generic answer is given: multicollinearity causes the estimated parameters to be unreliable. I always wanted to ask why? and had a ton of questions that follow. Well today I finally 'got my hands dirty' and got into the depths of it. Not only does it feel awesome to uncover the truth, but also I am seeing some of the math concepts that I learned about - eigenvalues/vectors, SVD, matrix's rank, determinant and condition number actually being used.

I wrote my notes, which I shared on r/learnmachinelearning to get opinions and feedback. But also I put it all in a colab/kaggle notebook.

Here is the gist

Building up to multicollinearity

Eigenvalues and Eigenvectors: when an eigenvalue is close to 0, it indicates that the associated eigenvector does not scale much during the applied transformation
Minimal scaling: the vector does not scale (much), this means the transformation has very little effect along the direction of that eigenvector (little effect on the features or characteristics of the data that are aligned with that particular eigenvector)
Redundancy: if an eigenvector is nearly parallel to the null space of the transformation matrix, this suggests there is a linear combination of the data (features in the design matrix) that has no effect on the outcome
Multicollinearity: redundancy (or near-redundancy) in the data is a characteristic of multicollinearity - one predictor variable can be approximately expressed as a linear combination of the others. If we have multicollinearity, there is near linear dependence and the matrix is (near) singular, which makes X^TX ill-conditioned and numerically unstable (sensitive to pertrubations) which inflates the parameter variances, resulting in poor estimates

In the notebook there are some code examples as well showing the eigenvalues, rank, condition number and a logistic regression when THERE IS and there is NOT multicollinearity.

Below are the results from the final linear regression:

There is multicollinearity

Multicollinearity case:

eigenvalues=array([ 0., 275.])

Rank of the matrix: 1

condition_number=inf

Original Coefficients and Intercept:

Coefficients: [0.80039912 1.60079824]

Intercept: 0.03991352436920792

Perturbed Coefficients and Intercept:

Coefficients: [-8.6065131 6.30414128]

Intercept: 0.03940363891543086

There is NO multicollinearity

No multicollinearity case:

eigenvalues=array([ 1.9623732, 188.0376268])

Rank of the matrix: 2

condition_number=95.82154225314864

Original Coefficients and Intercept:

Coefficients: [2.05446569 0.92175784]

Intercept: 0.15618973818845738

Perturbed Coefficients and Intercept:

Coefficients: [2.05486836 0.92235525]

Intercept: 0.1525767124126478

The above showcases one of the consequences of multicollinearity - making the parameters sensitive to even small changes. For the full code I recommend visiting the kaggle colab (it is a quick read).

Secondly, my notes from the lecture on frequent itemset mining

Covered topics:

Frequent itemsets, association rule mining, finding frequent itemsets, finding frequent pairs, A-Priori algorithm, PCY (Park-Chen-Yu) algorithm, frequent itemsets in <= passes

That is all for today!

See you tomorrow :)

Search This Blog

50+ days of Machine Learning