Correlation
Correlation is a measure of robustness of a relation between two variables. The coefficient of correlation is used in various statistical analysis and machine learning algorithm
Need for correlation
For comparing two bivariate datasets, mean, median, mode, standard deviation and other measures of central tendencies could not be used as it was possible to have two sets of numbers with the same measures of central tendency and while calculating median and mode, we have to sort the dataset which leads to the violation of parity as after sorting, different values of x may be mapped to different values of y
Intuition behind correlation
Let there be two data series named x and y
For analyzing the correlation, we can plot them as follows
Then we draw a horizontal line at y=mean(y) and a vertical line at x=mean(x)
Now we can shift the origin to (x-mean(x), y-mean(y)) and we can then take the product of abscissa and ordinate of each of the point, now this product is negative for quadrants 2 and 4 and positive for quadrants 1 and 3, which means that if the value of x is increasing with the value of y, the product is positive and when the value of x is decreasing with value of y, we can say that the product is negative. Then we take sum of this product for all the points and then we divide the sum by the number of points in the data series to eliminate the effect of number of points in the data series. By doing so we obtain a quantity known as covariance which can be given by the formula
Now we divide covariance by the product of standard deviations of x and y. We do so because
- It cancels the effect of units of each column in the bivariate data series and hence makes correlation independent of units so that we can compare any series
- It brings the correlation to the range of -1 to 1 which provides a universal scale for comparison of bivariate data series
- Since standard deviation is always positive, it does not change the sign of any term in the formula of covariance
Hence the formula of correlation can be given by
Properties of correlation
- It is symmetric which means that correlation between x and y is same as the correlation between y and x
- It is always between -1 and 1
- Correlation coefficient is independent of scale change and origin change which means that correlation does not change when you multiply or divide each element of the series with a particular number or you add or subtract each element of the series by a particular number
Inference of correlation
- If the correlation coefficient is close to -1 it means that the value of x decreases with increase in value of y. This implies strong negative correlation
- If the correlation coefficient is close to 0 it means that the value of x is independent of value of y. This implies no correlation
- If the correlation coefficient is close to 1 it means that the value of x increases with increase in value of y. This implies strong positive correlation
Demerits of correlation
Correlation can only determine the linear relation between two variables and cannot determine polynomial relation between two variables.