Correlation is a measure of robustness of a relation between two variables. The coefficient of correlation is used in various statistical analysis and machine learning algorithm

Need for correlation

For comparing two bivariate datasets, mean, median, mode, standard deviation and other measures of central tendencies could not be used as it was possible to have two sets of numbers with the same measures of central tendency and while calculating median and mode, we have to sort the dataset which leads to the violation of parity as after sorting, different values of x may be mapped to different values of y

Intuition behind correlation

Let there be two data series named x and y

For analyzing the correlation, we can plot them as follows

Then we draw a horizontal line at y=mean(y) and a vertical line at x=mean(x)

Now we can shift the origin to (x-mean(x), y-mean(y)) and we can then take the product of abscissa and ordinate of each of the point, now this product is negative for quadrants 2 and 4 and positive for quadrants 1 and 3, which means that if the value of x is increasing with the value of y, the product is positive and when the value of x is decreasing with value of y, we can say that the product is negative. Then we take sum of this product for all the points and then we divide the sum by the number of points in the data series to eliminate the effect of number of points in the data series. By doing so we obtain a quantity known as covariance which can be given by the formula

Now we divide covariance by the product of standard deviations of x and y. We do so because

  1. It cancels the effect of units of each column in the bivariate data series and hence makes correlation independent of units so that we can compare any series

Hence the formula of correlation can be given by

Properties of correlation

  1. It is symmetric which means that correlation between x and y is same as the correlation between y and x

Inference of correlation

  1. If the correlation coefficient is close to -1 it means that the value of x decreases with increase in value of y. This implies strong negative correlation

Demerits of correlation

Correlation can only determine the linear relation between two variables and cannot determine polynomial relation between two variables.