# The technique of Gradient Descent

Gradient descent is an algorithm which aims to minimize the error or the loss metric in order to obtain the best possible set of parameters for your model. This technique is very flexible and can have many hyperparameters which can be tuned for better optimization

Gradient descent technique can be applied to various models in machine learning and neural networks to find the best optimum set of parameters without using advanced mathematics

Terms related to gradient descent:

- Learning rate — The rate at which the descent on the graph occurs, higher learning rate can imply faster descent on the graph which tends to miss the local minima during the gradient descent
- Epoch — The number of times you wish to revise the hyperparameters in order to reach the best possible hyperparameter

Visualizing gradient descent

Process of Gradient Descent

While performing gradient descent, we first calculate the hypothesis function for the initialized theta value. Then we calculate the cost according to the model and then we minimize the cost by using a specified learning rate and we obtain the best theta value for the parameter

Significance of each term in gradient descent

- Theta new — The new value of theta which is obtained after the iteration of gradient descent which yields more accurate predictions
- Theta old — The old value of theta which has to be updated
- Negative sign — This always guides the algorithm in the right direction, because if the derivative of the cost function is positive, then the gradient descent algorithm will direct the theta towards the left which points in the direction of decreasing cost whereas if the derivative of the cost function is negative, the algorithm will direct the theta towards the right which points in the direction of decreasing cost
- Learning rate — This term determines the size of the steps taken in the direction of the minima to find the parameter with the least cost
- Derivative — This term determines the magnitude of the differentiation which also plays a role in determining the steps taken in the direction of the minima to find the parameter with the least cost

Implementing Gradient Descent in various machine learning models

- Linear Regression

In this model, we aim to find the parameter which fits the linear model in an optimized way which minimizes the cost

Cost function

The hypothesis function of Linear Regression can also be expressed as the dot product of the initialized parameters (represented by theta) and the inputs

Where

- Theta 0 is the intercept
- Theta1 is the parameter/coefficients of regression
- X is the input

Now, lets look at the cost function of Linear Regression

Where

- m is the number of training examples
- h(theta) is the hypothesis function
- y is the target variable

Significance of each term in the cost function

Dividing by m — Since we do not want the number of points in the training set to affect the cost, we divide by m to make the cost independent of the number of training examples

Sigma — Since we want to calculate the cost over all training examples, we take the sum of squared errors

Squared error:

- Positive and negative errors might cancel out each other, hence taking the square of the errors prevents the errors from being cancelled
- The errors less than 1 in magnitude are reduced and the errors greater than 1 in magnitude are increased imposing more penalty for those errors and hence helping our model to learn in a better way

Calculating the differentiation of the cost function

Using Linear Algebra in Gradient Descent

We can concatenate ones to the inputs in case we want to find the intercept

For finding the hypothesis function, we can take the product of initialized parameters with inputs

We can calculate the error by subtracting y from the hypothesis function

We can calculate the cost by using the formula

Finally, we can calculate differentiation by using the formula

Python implementation of Gradient Descent using Linear Regression

Written by:

Aayushmaan Jain