Correlation

Aayushmaan Jain
3 min readJun 17, 2021

Correlation is a measure of robustness of a relation between two variables. The coefficient of correlation is used in various statistical analysis and machine learning algorithm

Need for correlation

For comparing two bivariate datasets, mean, median, mode, standard deviation and other measures of central tendencies could not be used as it was possible to have two sets of numbers with the same measures of central tendency and while calculating median and mode, we have to sort the dataset which leads to the violation of parity as after sorting, different values of x may be mapped to different values of y

Intuition behind correlation

Let there be two data series named x and y

For analyzing the correlation, we can plot them as follows

Then we draw a horizontal line at y=mean(y) and a vertical line at x=mean(x)

Now we can shift the origin to (x-mean(x), y-mean(y)) and we can then take the product of abscissa and ordinate of each of the point, now this product is negative for quadrants 2 and 4 and positive for quadrants 1 and 3, which means that if the value of x is increasing with the value of y, the product is positive and when the value of x is decreasing with value of y, we can say that the product is negative. Then we take sum of this product for all the points and then we divide the sum by the number of points in the data series to eliminate the effect of number of points in the data series. By doing so we obtain a quantity known as covariance which can be given by the formula

Now we divide covariance by the product of standard deviations of x and y. We do so because

  1. It cancels the effect of units of each column in the bivariate data series and hence makes correlation independent of units so that we can compare any series
  2. It brings the correlation to the range of -1 to 1 which provides a universal scale for comparison of bivariate data series
  3. Since standard deviation is always positive, it does not change the sign of any term in the formula of covariance

Hence the formula of correlation can be given by

Properties of correlation

  1. It is symmetric which means that correlation between x and y is same as the correlation between y and x
  2. It is always between -1 and 1
  3. Correlation coefficient is independent of scale change and origin change which means that correlation does not change when you multiply or divide each element of the series with a particular number or you add or subtract each element of the series by a particular number

Inference of correlation

  1. If the correlation coefficient is close to -1 it means that the value of x decreases with increase in value of y. This implies strong negative correlation
  2. If the correlation coefficient is close to 0 it means that the value of x is independent of value of y. This implies no correlation
  3. If the correlation coefficient is close to 1 it means that the value of x increases with increase in value of y. This implies strong positive correlation

Demerits of correlation

Correlation can only determine the linear relation between two variables and cannot determine polynomial relation between two variables.

--

--

Aayushmaan Jain

A data science enthusiast currently pursuing a bachelor's degree in data science