The technique of Gradient Descent
Gradient descent is an algorithm which aims to minimize the error or the loss metric in order to obtain the best possible set of parameters for your model. This technique is very flexible and can have many hyperparameters which can be tuned for better optimization
Gradient descent technique can be applied to various models in machine learning and neural networks to find the best optimum set of parameters without using advanced mathematics
Terms related to gradient descent:
- Learning rate — The rate at which the descent on the graph occurs, higher learning rate can imply faster descent on the graph which tends to miss the local minima during the gradient descent
- Epoch — The number of times you wish to revise the hyperparameters in order to reach the best possible hyperparameter
Visualizing gradient descent
Process of Gradient Descent
While performing gradient descent, we first calculate the hypothesis function for the initialized theta value. Then we calculate the cost according to the model and then we minimize the cost by using a specified learning rate and we obtain the best theta value for the parameter
Significance of each term in gradient descent
- Theta new — The new value of theta which is obtained after the iteration of gradient descent which yields more accurate predictions
- Theta old — The old value of theta which has to be updated
- Negative sign — This always guides the algorithm in the right direction, because if the derivative of the cost function is positive, then the gradient descent algorithm will direct the theta towards the left which points in the direction of decreasing cost whereas if the derivative of the cost function is negative, the algorithm will direct the theta towards the right which points in the direction of decreasing cost
- Learning rate — This term determines the size of the steps taken in the direction of the minima to find the parameter with the least cost
- Derivative — This term determines the magnitude of the differentiation which also plays a role in determining the steps taken in the direction of the minima to find the parameter with the least cost
Implementing Gradient Descent in various machine learning models
- Linear Regression
In this model, we aim to find the parameter which fits the linear model in an optimized way which minimizes the cost
Cost function
The hypothesis function of Linear Regression can also be expressed as the dot product of the initialized parameters (represented by theta) and the inputs
Where
- Theta 0 is the intercept
- Theta1 is the parameter/coefficients of regression
- X is the input
Now, lets look at the cost function of Linear Regression
Where
- m is the number of training examples
- h(theta) is the hypothesis function
- y is the target variable
Significance of each term in the cost function
Dividing by m — Since we do not want the number of points in the training set to affect the cost, we divide by m to make the cost independent of the number of training examples
Sigma — Since we want to calculate the cost over all training examples, we take the sum of squared errors
Squared error:
- Positive and negative errors might cancel out each other, hence taking the square of the errors prevents the errors from being cancelled
- The errors less than 1 in magnitude are reduced and the errors greater than 1 in magnitude are increased imposing more penalty for those errors and hence helping our model to learn in a better way
Calculating the differentiation of the cost function
Using Linear Algebra in Gradient Descent
We can concatenate ones to the inputs in case we want to find the intercept
For finding the hypothesis function, we can take the product of initialized parameters with inputs
We can calculate the error by subtracting y from the hypothesis function
We can calculate the cost by using the formula
Finally, we can calculate differentiation by using the formula
Python implementation of Gradient Descent using Linear Regression
Written by:
Aayushmaan Jain