Univariate Linear Regression
Regression
In regression we fit a line through the data points to predict a continuous target variable using various independent explanatory variables
Let the line be y= α+βx
We use the least squared error approach We square the error because:
- It makes all errors positive and eliminates nullification of positive and negative error
- It diminishes the errors which are less than one and magnifies the errors which are greater than 1
- The projections of error taken along the x and y axis add up to the square of the error itself for example r sinθ+r cosθ=r2
Error = y-α-βx
Squared error = (y-α-βx)2
Sum of squared error = ∑(y-α-βx)2
Now we need to minimize the error with respect to alpha and beta
Error minimized with respect to alpha gives:
Error minimized with respect to beta gives
We can solve these equations to get the value of alpha and beta or else we can solve them by eliminating one variable
Now on subtracting 2 from 1 gives
Now on dividing both sides by n2 we get
Which can be simplified as
Hence:
For finding alpha
Python code for implementing Univariate Linear Regression
Regression using SciKitLearn
Regression using Statsmodels
Standard error
The standard error of the regression (S), also known as the standard error of the estimate, represents the average distance that the observed values fall from the regression line. Conveniently, it tells you how wrong the regression model is on average using the units of the response variable. Smaller values are better because it indicates that the observations are closer to the fitted line.
Here is the formula for standard error
Here y hat denotes the prediction that we have made using the formula y= α+βx
Here is the code to find the standard error using the Regressor class that we have created
R Squared (R2)
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.
It is the percentage of the response variable variation that is explained by a linear model.
Or:
R-squared = Explained variation / Total variation
The formula for R2 is:
Where:
ypred are our predictions using the regression model
Here is the python code to find the R2: