Gradient descent step by step

In this blog, we’ll explore how to use gradient descent to fit a line to three data points. Instead of diving into the theoretical aspects, we’ll take a hands-on approach by using a simple slope-intercept form. We’ll start with an initial slope and intercept of zero, then calculate the value of y using our hypothesis function. Based on the error value, we’ll adjust the slope and intercept for the next iteration to reduce our errors. We’ll repeat this process multiple times until the errors decrease significantly. 

Let’s cover some basic definitions here. The gradient of a function \( f(x) \) is the vector of its partial derivatives with respect to its input variables. Mathematically, for a function \( f(x_1, x_2, ….., x_n )\), the gradient is:

\( \nabla f = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, …, \frac{\partial f}{\partial x_n} \right) \)

 

Gradient in Machine Learning

  • In supervised learning, we often minimize a loss function \(J(\theta\)) where \(\theta\) represents the model parameters.
  • The gradient of the loss function with respect to \(\theta\) tells us how to update the model parameters to reduce the loss.
  • Gradient Descent is an algorithm that updates parameters in the direction of the negative gradient to minimize the function.

Let’s start with the basic loss function with just a slope \( (\theta_1) \) and intercept \( ( \theta_0 ) \) and compute the gradients step by step for 3 data points
\( J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) – y^{(i)} \right)^2
\)

where:

  • mm is the number of data points.
  • hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 x  is the hypothesis (predicted value).
  • y(i)y^{(i)} is the actual value.

Let’s assume we have three points:

(x1,y1)=(1,2),(x2,y2)=(2,2.5),(x3,y3)=(3,3.5)(x_1, y_1) = (1, 2), \quad (x_2, y_2) = (2, 2.5), \quad (x_3, y_3) = (3, 3.5)

We want to fit a line to the above points using gradient descent.

A straight line hypothesis is given by expression:

\( h_\theta(x) = \theta_0 + \theta_1 x \)

Our approach will be to calculate the predictions with initial values of slope \(\theta_1\) and intercept \(\theta_0\) which in this case will be both zero. Once we calculate predictions, we will find, by how much are our predictions off. This will mean calculating errors. 

(x_1, y_1) = (1, 2), \quad (x_2, y_2) = (2, 2.5), \quad (x_3, y_3) = (3, 3.5)

Data Point (i) \( x^{(i)} \) (Input) \( y^{(i)} \) (Actual Output) \( h_\theta(x^{(i)}) \) (Prediction) Error \( h_\theta(x^{(i)}) - y^{(i)} \)
1 1 2 \( 0 + 0(1) = 0 \) \( 0 - 2 = -2 \)
2 2 2.5 \( 0 + 0(2) = 0 \) \( 0 - 2.5 = -2.5 \)
3 3 3.5 \( 0 + 0(3) = 0 \) \( 0 - 3.5 = -3.5 \)

Calculate gradients of  loss function with respect to slope and intercept. The loss function is given as:

\( J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) – y^{(i)} \right)^2
\)

Change in loss function w.r.t. intercept

\(\frac{\partial J}{\partial \theta_0}\) (Intercept)
\( \frac{\partial J}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \)
\( \frac{1}{3} \left( (-2) + (-2.5) + (-3.5) \right) \)
\( \frac{1}{3} (-8) = -2.67 \)

Change in loss function w.r.t. slope

\(\frac{\partial J}{\partial \theta_1}\) (Slope)
\( \frac{\partial J}{\partial \theta_1} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)} \)
\( \frac{1}{3} \left( (-2 \times 1) + (-2.5 \times 2) + (-3.5 \times 3) \right) \)
\( \frac{1}{3} (-17.5) = -5.83 \)

What do negative values of derivative indicate?

The gradient (partial derivative) of the loss function w.r.t. \( \theta_0 \): \( \frac{\partial J}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) – y^{(i)}) \) . This means: 

  • if \( h_\theta(x)  – y \) (the error) is negative on average, the derivative is negative
  • if \( h_\theta(x) – y \) (the error) is positive on average, the derivative is positive

What happens when the gradient is negative : \(
\frac{\partial J}{\partial \theta_0} < 0 \) 

  • This means that the model is underestimating the actual values yy
  • The predicted values \(h_\theta(x) \) are too low compared to yy.
  • To minimize loss, we increase \( \theta_0\) in the next gradient descent step.

Using gradient descent update rule

\( \theta_j := \theta_j – \alpha . \frac{\partial J}{\partial \theta_j} \)

Let’s use learning rate \( \alpha = 0.01\):

 

\(\text{Update } \theta_0:\) \(\theta_0 := 0 - (0.01 \times -2.67) = 0.0267\)
\(\text{Update } \theta_1:\) \(\theta_1 := 0 - (0.01 \times -5.83) = 0.0583\)

using updated \(\theta_0\) and \(\theta_1\), we can continue calculating derivates, until the loss decreases to an acceptable value

Iteration Learning Rate (α) \(\theta_0\) (Intercept) \(\theta_1\) (Slope) \( h(\theta) \) (Predictions) Error \( h(\theta) - y \) Gradient \( \frac{\partial J}{\partial \theta_0} \) Gradient \( \frac{\partial J}{\partial \theta_1} \) Update for \( \theta_0 \) Update for \( \theta_1 \)
1 0.01 0.0267 0.0583 [0.085, 0.143, 0.201] [-1.915, -2.357, -3.299] -2.67 -5.83 \( 0.0267 - (0.01 \times -2.67) \) \( 0.0583 - (0.01 \times -5.83) \)
2 0.01 0.0526 0.1148 [0.171, 0.286, 0.400] [-1.829, -2.214, -3.100] -2.59 -5.65 \( 0.0526 - (0.01 \times -2.59) \) \( 0.1148 - (0.01 \times -5.65) \)
3 0.01 0.0779 0.1696 [0.258, 0.429, 0.600] [-1.742, -2.071, -2.900] -2.53 -5.48 \( 0.0779 - (0.01 \times -2.53) \) \( 0.1696 - (0.01 \times -5.48) \)
40.010.10280.2230[0.343, 0.572, 0.800][-1.657, -1.928, -2.700]-2.47-5.31\( 0.1028 - (0.01 \times -2.47) \)\( 0.2230 - (0.01 \times -5.31) \)
50.010.12720.2751[0.427, 0.714, 1.000][-1.573, -1.786, -2.500]-2.41-5.14\( 0.1272 - (0.01 \times -2.41) \)\( 0.2751 - (0.01 \times -5.14) \)
60.010.15120.3260[0.510, 0.857, 1.200][-1.490, -1.643, -2.300]-2.35-4.98\( 0.1512 - (0.01 \times -2.35) \)\( 0.3260 - (0.01 \times -4.98) \)
70.010.17470.3756[0.592, 1.000, 1.400][-1.408, -1.500, -2.100]-2.29-4.83\( 0.1747 - (0.01 \times -2.29) \)\( 0.3756 - (0.01 \times -4.83) \)
200.010.47821.0223[1.478, 2.500, 3.500][-0.522, -0.000, 0.000]-1.50-2.80\( 0.4782 - (0.01 \times -1.50) \)\( 1.0223 - (0.01 \times -2.80) \)

Below you can see how gradually the slope and intercept is updated in every iteration until the final line aligns with 3 data points

Leave a Comment

Your email address will not be published. Required fields are marked *