The issue of multicollinearity

What is multicollinearity?

Multicollinearity is defined as a condition where two or more explanatory variables are related amongst themselves which may cause misleading predictions.

When is multicollinearity an issue?

Multicollinearity is an issue when the correlations between the columns may change with change in the conditions.

Steps to detect multicollinearity

We can detect multicollinearity by the following ways

  1. We can look at the correlation matrix or correlation heatmap and look out for high values in the dataset
  2. We can look at the variance inflation factor (discussed later in this article) of each column and look out for values of variance inflation factor greater than 5
Correlation matrix
Correlation heatmap

Variance Inflation Factor (VIF)

VIF is a metric which is used for measuring the amount of multicollinearity in the dataset by regressing an explanatory variable on other explanatory variables and then looking at the R squared of that model

Steps for solving Multicollinearity

  1. Drop variables on the basis of domain knowledge if possible
  2. Look out for variables with high correlation coefficient in the correlation matrix and drop them
  3. Look out for variables with a VIF value greater than 5 and drop them
  4. Use lasso regression as the regularization function in lasso regression automatically drops one variable
  5. Use PCA technique to solve the issue of multicollinearity

--

--

A data science enthusiast currently pursuing a bachelor's degree in data science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aayushmaan Jain

A data science enthusiast currently pursuing a bachelor's degree in data science