The technique of Gradient Descent

5 min readAug 12, 2021

Gradient descent is an algorithm which aims to minimize the error or the loss metric in order to obtain the best possible set of parameters for your model. This technique is very flexible and can have many hyperparameters which can be tuned for better optimization

Gradient descent technique can be applied to various models in machine learning and neural networks to find the best optimum set of parameters without using advanced mathematics

Terms related to gradient descent:

Learning rate — The rate at which the descent on the graph occurs, higher learning rate can imply faster descent on the graph which tends to miss the local minima during the gradient descent
Epoch — The number of times you wish to revise the hyperparameters in order to reach the best possible hyperparameter

Visualizing gradient descent

Variation of cost of gradient descent while changing the parameters by gradient descent

Process of Gradient Descent

While performing gradient descent, we first calculate the hypothesis function for the initialized theta value. Then we calculate the cost according to the model and then we minimize the cost by using a specified learning rate and we obtain the best theta value for the parameter

Significance of each term in gradient descent

Theta new — The new value of theta which is obtained after the iteration of gradient descent which yields more accurate predictions
Theta old — The old value of theta which has to be updated
Negative sign — This always guides the algorithm in the right direction, because if the derivative of the cost function is positive, then the gradient descent algorithm will direct the theta towards the left which points in the direction of decreasing cost whereas if the derivative of the cost function is negative, the algorithm will direct the theta towards the right which points in the direction of decreasing cost
Learning rate — This term determines the size of the steps taken in the direction of the minima to find the parameter with the least cost
Derivative — This term determines the magnitude of the differentiation which also plays a role in determining the steps taken in the direction of the minima to find the parameter with the least cost

Implementing Gradient Descent in various machine learning models

Linear Regression

In this model, we aim to find the parameter which fits the linear model in an optimized way which minimizes the cost

Cost function

The hypothesis function of Linear Regression can also be expressed as the dot product of the initialized parameters (represented by theta) and the inputs

Hypothesis function for Linear Regression

Where

Theta 0 is the intercept
Theta1 is the parameter/coefficients of regression
X is the input

Now, lets look at the cost function of Linear Regression

Cost function of Linear Regression

Where

m is the number of training examples
h(theta) is the hypothesis function
y is the target variable

Significance of each term in the cost function

Dividing by m — Since we do not want the number of points in the training set to affect the cost, we divide by m to make the cost independent of the number of training examples

Sigma — Since we want to calculate the cost over all training examples, we take the sum of squared errors

Squared error:

Positive and negative errors might cancel out each other, hence taking the square of the errors prevents the errors from being cancelled
The errors less than 1 in magnitude are reduced and the errors greater than 1 in magnitude are increased imposing more penalty for those errors and hence helping our model to learn in a better way

Calculating the differentiation of the cost function