[Day 121] Uncovering the full reason behing multicollinearity + Frequent itemset mining lecture
Hello :)
Today is Day 121!
A quick summary of today:- finally uncovered the full story behind multicollinearity
- covered lecture 2 from Stanford's CS246: Mining massive datasets: Frequent itemset mining
Whenever I read books, or see blogs, ask an LLM, what happens when there is multicollinearity - the same generic answer is given: multicollinearity causes the estimated parameters to be unreliable. I always wanted to ask why? and had a ton of questions that follow. Well today I finally 'got my hands dirty' and got into the depths of it. Not only does it feel awesome to uncover the truth, but also I am seeing some of the math concepts that I learned about - eigenvalues/vectors, SVD, matrix's rank, determinant and condition number actually being used.
I wrote my notes, which I shared on r/learnmachinelearning to get opinions and feedback. But also I put it all in a colab/kaggle notebook.
Here is the gist
Building up to multicollinearity
- Eigenvalues and Eigenvectors: when an eigenvalue is close to 0, it indicates that the associated eigenvector does not scale much during the applied transformation
- Minimal scaling: the vector does not scale (much), this means the transformation has very little effect along the direction of that eigenvector (little effect on the features or characteristics of the data that are aligned with that particular eigenvector)
- Redundancy: if an eigenvector is nearly parallel to the null space of the transformation matrix, this suggests there is a linear combination of the data (features in the design matrix) that has no effect on the outcome
- Multicollinearity: redundancy (or near-redundancy) in the data is a characteristic of multicollinearity - one predictor variable can be approximately expressed as a linear combination of the others. If we have multicollinearity, there is near linear dependence and the matrix is (near) singular, which makes X^TX ill-conditioned and numerically unstable (sensitive to pertrubations) which inflates the parameter variances, resulting in poor estimates
In the notebook there are some code examples as well showing the eigenvalues, rank, condition number and a logistic regression when THERE IS and there is NOT multicollinearity.
Below are the results from the final linear regression:
- There is multicollinearity
Multicollinearity case:
eigenvalues=array([ 0., 275.])
Rank of the matrix: 1
condition_number=inf
Original Coefficients and Intercept:
Coefficients: [0.80039912 1.60079824]
Intercept: 0.03991352436920792
Perturbed Coefficients and Intercept:
Coefficients: [-8.6065131 6.30414128]
Intercept: 0.03940363891543086
- There is NO multicollinearity
No multicollinearity case:
eigenvalues=array([ 1.9623732, 188.0376268])
Rank of the matrix: 2
condition_number=95.82154225314864
Original Coefficients and Intercept:
Coefficients: [2.05446569 0.92175784]
Intercept: 0.15618973818845738
Perturbed Coefficients and Intercept:
Coefficients: [2.05486836 0.92235525]
Intercept: 0.1525767124126478
The above showcases one of the consequences of multicollinearity - making the parameters sensitive to even small changes. For the full code I recommend visiting the kaggle colab (it is a quick read).
Covered topics:
Frequent itemsets, association rule mining, finding frequent itemsets, finding frequent pairs, A-Priori algorithm, PCY (Park-Chen-Yu) algorithm, frequent itemsets in <= passes
That is all for today!
See you tomorrow :)