The issue of multicollinearity

What is multicollinearity?

Multicollinearity is defined as a condition where two or more explanatory variables are related amongst themselves which may cause misleading predictions.

When is multicollinearity an issue?

Multicollinearity is an issue when the correlations between the columns may change with change in the conditions.

For example let us take the scenario of the stock market before and after the COVID-19 pandemic.

Before the pandemic, the automobile sector and pharmaceutical sector were both doing well in their respective fields, which also implies high correlation.

If we calibrate a machine learning model based on pre COVID scenario and deploy that machine learning model without checking for multicollinearity, initially it may give accurate results

But after COVID pandemic, we have observed that the automobile sector has seen a decline in it’s performance and the pharmaceutical sector has seen an increase in it’s performance.

Now the predictions given by our machine learning model which was calibrated before the COVID-19 pandemic using the correlation between the sectors at that time will give us inaccurate predictions hence we will need to calibrate a new model

To avoid the multicollinearity trap, we should not include highly correlated explanatory variables and try to adhere to the assumption of independence of regressors of the Gauss Markov setup

Steps to detect multicollinearity

We can detect multicollinearity by the following ways

Correlation matrix
Correlation heatmap

Variance Inflation Factor (VIF)

VIF is a metric which is used for measuring the amount of multicollinearity in the dataset by regressing an explanatory variable on other explanatory variables and then looking at the R squared of that model

Let the model be

Here X1 can be expressed as a linear combination of all other regressors excluding X1

And R1 squared will be the R squared of this model, similarly VIF of X1 will be

The value of VIF will increase if the value of R1 squared will increase, which signifies that X1 can be efficiently expressed as a linear combination of other regressors implying that X1 can be explained by other regressors and X1 should be dropped

Similarly, we calculate VIF for all the regressors and the regressors having VIF greater than 5 should be dropped as they have high multicollinearity.

Steps for solving Multicollinearity

A data science enthusiast currently doing bachelor's degree in data science