Matrix form of the closed-form solution for the hypothesis function

The cost function $J(\theta)$ is a mathematical measure of how well a machine learning model’s predictions match the actual target values. It evaluates the overall error between predicted outputs $h_\theta(x)$ and true outputs $y$ across the entire dataset.

For example, in linear regression, the cost function is typically the Mean Squared Error (MSE):

$J (θ) = \frac{1}{2 m} \sum_{i = 1}^{m} {(h_{θ} (x^{(i)}) - y^{(i)})}^{2}$

Where:

$m$ : Total number of training examples.
$h_\theta(x^{(i)})$ : Predicted value for the $i$ -th example.
$y^{(i)}$ : Actual value for the $i$ -th example.

The goal of training is to minimize $J(\theta)$ by adjusting $\theta$ , ensuring the model makes accurate predictions

$ J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 $

Why divide by 2m ?

Dividing by $m$ ensures that the cost function represents the average squared error, rather than the total squared error. This normalization helps keep the cost function independent of the dataset size.

The factor of 2 from squaring the residual cancels with the $\frac{1}{2}$ , making the gradient simpler and cleaner (more on this later).

Vectorizing the hypothesis function

Instead of writing the hypothesis for individual examples, we represent all predictions as a vector:

$h_{θ} (X) = X θ$

Where:

$X: m x (n + 1)matrix of all training examples (including a bias term).$

$θ$ vector of parameters.

Square of residuals

The squared residuals are the element-wise square of

$X\theta – y$

In matrix notation, we express this as the dot product:

Sum of squared residuals= $ (X\theta - y)^T (X\theta - y) $

How does above product is equal to sum of squared residuals?

Consider two vectors $A$ and $B$ :
- $A$ and $B$ are column vectors of the same size.
- Example: $A = \begin{bmatrix} a_1 \\ a_2 \\ a_3 \end{bmatrix}$ $B = \begin{bmatrix} b_1 \\ b_2 \\ b_3 \end{bmatrix}$
Difference $A – B$ :
- Subtract corresponding elements of $A$ and $B$ : $A – B = \begin{bmatrix} a_1 – b_1 \\ a_2 – b_2 \\ a_3 – b_3 \end{bmatrix}$
Transpose of $A – B$ :
- Transpose flips $A – B$ from a column to a row vector: $(A – B)^T = \begin{bmatrix} a_1 – b_1 & a_2 – b_2 & a_3 – b_3 \end{bmatrix}$
Dot Product with Itself:
- Multiply $(A – B)^T$ with $(A – B)$ : $(A – B)^T (A – B) = (a_1 – b_1)^2 + (a_2 – b_2)^2 + (a_3 – b_3)^2$

Meaning

This operation calculates the sum of the squared differences between corresponding elements of $A$ and $B$ .
It represents the squared Euclidean distance between $A$ and $B$ .

To simplify, we’ll first distribute the transpose operation to the terms inside the parentheses. The transpose of a difference is the difference of the transposes, so we get:

$(X θ - y)^{T} = (X θ)^{T} - y^{T}$

Since

$(X\theta)^T = \theta^T X^T$

, we can rewrite the transpose as:

$(X θ - y)^{T} = θ^{T} X^{T} - y^{T}$

Now, substitute this back into the original expression for $j (\theta)$ :

$j(\theta) = (\theta^T X^T – y^T)(X\theta – y)$

Now expand this product:

$j(\theta) = \theta^T X^T X \theta – \theta^T X^T y – y^T X \theta + y^T y$

Let’s focus on $ \theta^TX^Ty $ and $y^TX\theta $

Imagine
X: A matrix of dimensions n x m
θ: A vector of dimensions m x 1
$y$ : A vector of dimension n x 1

$ X\theta =
\begin{bmatrix}
x_{11} & x_{12} & \dots & x_{1m} \\
x_{21} & x_{22} & \dots & x_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n1} & x_{n2} & \dots & x_{nm}
\end{bmatrix}
\begin{bmatrix}
\theta_1 \\
\theta_2 \\
\vdots \\
\theta_m
\end{bmatrix} =
\begin{bmatrix}
\sum_{i=1}^m x_{1i}\theta_i \\
\sum_{i=1}^m x_{2i}\theta_i \\
\vdots \\
\sum_{i=1}^m x_{ni}\theta_i
\end{bmatrix} $

$ y^T(X\theta) =
\begin{bmatrix}
y_1 & y_2 & \dots & y_n
\end{bmatrix}
\begin{bmatrix}
(X\theta)_1 \\
(X\theta)_2 \\
\vdots \\
(X\theta)_n
\end{bmatrix} =
\sum_{i=1}^n y_i \cdot (X\theta)_i $

$ (X\theta)^T y =
\begin{bmatrix}
(X\theta)_1 & (X\theta)_2 & \dots & (X\theta)_n
\end{bmatrix}
\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_n
\end{bmatrix} =
\sum_{i=1}^n (X\theta)_i \cdot y_i $

Both $y^T(X\theta) $ and $ (X\theta)^Ty $ result in the same scalar value because the dot product is symmetric.

how $y^T(X\theta) $ and $ (X\theta)^Ty $ result in the same scalar

Assume $ X $ is a $ 2 \times 3 $ matrix, $ \theta $ is a $ 3 \times 1 $ vector, and $ y $ is a $ 2 \times 1 $ vector:

$ X = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}, \quad \theta = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}, \quad y = \begin{bmatrix} 7 \\ 8 \end{bmatrix} $

Compute $ \theta^T X^T y $

Compute $ X^T y $:

$ X^T y = \begin{bmatrix} 39 \\ 54 \\ 69 \end{bmatrix} $

Compute $ \theta^T X^T y $:

$ \theta^T X^T y = \newline (1)(39) + (1)(54) + (1)(69) = 162 $

Compute $ y^T X \theta $

Compute $ X \theta $:

$ X \theta = \begin{bmatrix} 6 \\ 15 \end{bmatrix} $

Compute $ y^T X \theta $:

$ y^T X \theta = (7)(6) + (8)(15) = 162 $

Since $ \theta^T X^T y = y^T X \theta $, both terms evaluate to 162, showing that they are equal and can be combined in the expression for $ j(\theta) $.

So this expression:
$ j(\theta) = \theta^T X^T X \theta – 2 (X \theta)^T y + y^T y $

can be simplified to

$ J(\theta) = \frac{1}{2m} \left[ \theta^T X^T X \theta – 2 ( X \theta) ^T y + y^T y \right] $

Let’s represent $ J(\theta) $ as :

$ J(\theta) = \frac{1}{2m} \left[\color{red}{P(\theta)} + \color{green}{Q(\theta)} + \color{blue}{y^Ty} \right] $

Where

$ \color{red}{P(\theta) = \theta^T X^T X \theta} $ and $ \color{green}{Q(\theta) = – 2 ( X \theta) ^T y} $

We can ignore the last term $ y^T y $ as it does not involve $ \theta $

Let’s calculate derivative of $ J(\theta) $ with respect to $ \theta $ by calculating $ \frac{\partial P(\theta)}{\partial \theta} and \frac{\partial Q(\theta)}{\partial \theta} $

and

$ \frac{\partial J(\theta)}{\partial \theta} = \frac{\partial P(\theta)}{\partial \theta} – \frac{\partial Q(\theta)}{\partial \theta} $

$ Q(\theta) = 2 ( X \theta) ^T y $

Using the example on the right, we can calculate

$ \frac{\partial Q(\theta)}{\partial \theta} = 2 X^T y $

Imagine a 2×3 matrix for X and a 2×1 matrix for $ \theta $
$ X = \begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix}, \quad \theta = \begin{bmatrix}
\theta_1 \\
\theta_2
\end{bmatrix} $

$ X\theta = \begin{bmatrix}
1\cdot\theta_1 + 2\cdot\theta_2 \\
3\cdot\theta_1 + 4\cdot\theta_2
\end{bmatrix} = \begin{bmatrix}
\theta_1 + 2\theta_2 \\
3\theta_1 + 4\theta_2
\end{bmatrix} $

The derivative of $X\theta $ with respect to $ \theta $ measures how $ X\theta $ changes as each component of $ \theta $ changes.

$ \frac{\partial (X\theta)}{\partial \theta} =
\begin{bmatrix}
\frac{\partial (\theta_1 + 2\theta_2)}{\partial \theta_1} & \frac{\partial (\theta_1 + 2\theta_2)}{\partial \theta_2} \\
\frac{\partial (3\theta_1 + 4\theta_2)}{\partial \theta_1} & \frac{\partial (3\theta_1 + 4\theta_2)}{\partial \theta_2}
\end{bmatrix} $

$ \frac{\partial (X\theta)}{\partial \theta} =
\begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix} = X $

To calculate, $ \frac{\partial P(\theta)}{\partial \theta} $ where $ P(\theta) = \theta^T X^T X \theta $
we need to use

$ \frac{\partial}{\partial \theta} \big( \theta^T A \theta \big) = 2A\theta $

(Expand the proof of above quadratic form on the right, if it doesn’t make sense)
Using quadratic (above) form we can deduce:

$ \frac{\partial P(\theta)}{\partial \theta} = 2(X^\top X)\theta $

derivative of quadratic form

Step 1: Expand $ Q(\theta) $

We expand $ Q(\theta) = \theta^T A \theta $ using matrix-vector multiplication:

        \[ A = \begin{bmatrix} 2 & 3 \\ 4 & 5 \end{bmatrix}, \quad \theta = \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} \]
    

Performing the multiplication step-by-step:

        \[ A \theta = \begin{bmatrix} 2 & 3 \\ 4 & 5 \end{bmatrix} \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} = \begin{bmatrix} 2\theta_1 + 3\theta_2 \\ 4\theta_1 + 5\theta_2 \end{bmatrix} \]
    

Now multiply $ \theta^T $ with $ A \theta $:

        \[ Q(\theta) = \begin{bmatrix} \theta_1 & \theta_2 \end{bmatrix} \begin{bmatrix} 2\theta_1 + 3\theta_2 \\ 4\theta_1 + 5\theta_2 \end{bmatrix} \]
    

Expanding:

        \[ Q(\theta) = 2\theta_1^2 + 3\theta_1\theta_2 + 4\theta_2\theta_1 + 5\theta_2^2 \]
    

Step 2: Compute the Derivative

The derivative $ \frac{\partial Q(\theta)}{\partial \theta} $ is:

        \[ \frac{\partial Q(\theta)}{\partial \theta} = \begin{bmatrix} \frac{\partial Q(\theta)}{\partial \theta_1} \\ \frac{\partial Q(\theta)}{\partial \theta_2} \end{bmatrix} \]
    

Step 2a: Derivative with respect to $ \theta_1 $

From the expanded form:

        \[ Q(\theta) = 2\theta_1^2 + 3\theta_1\theta_2 + 4\theta_2\theta_1 + 5\theta_2^2 \]
    

Term by term:

$ \frac{\partial}{\partial \theta_1}(2\theta_1^2) = 4\theta_1 $
$ \frac{\partial}{\partial \theta_1}(3\theta_1\theta_2) = 3\theta_2 $
$ \frac{\partial}{\partial \theta_1}(4\theta_2\theta_1) = 4\theta_2 $
$ \frac{\partial}{\partial \theta_1}(5\theta_2^2) = 0 $

Combine:

        \[ \frac{\partial Q(\theta)}{\partial \theta_1} = 4\theta_1 + 3\theta_2 + 4\theta_2 = 4\theta_1 + 7\theta_2 \]
    

Step 2b: Derivative with respect to $ \theta_2 $

Term by term:

$ \frac{\partial}{\partial \theta_2}(2\theta_1^2) = 0 $
$ \frac{\partial}{\partial \theta_2}(3\theta_1\theta_2) = 3\theta_1 $
$ \frac{\partial}{\partial \theta_2}(4\theta_2\theta_1) = 4\theta_1 $
$ \frac{\partial}{\partial \theta_2}(5\theta_2^2) = 10\theta_2 $

Combine:

        \[ \frac{\partial Q(\theta)}{\partial \theta_2} = 3\theta_1 + 4\theta_1 + 10\theta_2 = 7\theta_1+ 10\theta_2 \]
    

Step 3: Combine Results

The gradient is:

        \[ \frac{\partial Q(\theta)}{\partial \theta} = \begin{bmatrix} 4\theta_1 + 7\theta_2 \\ 7\theta_1 + 10\theta_2 \end{bmatrix} \]
    

Step 4: Matrix Form

Rewriting using $ A $:

        \[ 2A = \begin{bmatrix} 4 & 6 \\ 8 & 10 \end{bmatrix}, \quad 2A\theta = \begin{bmatrix} 4\theta_1 + 7\theta_2 \\ 7\theta_1 + 10\theta_2 \end{bmatrix} \]
    

Thus:

\[ \frac{\partial Q(\theta)}{\partial \theta} = 2A\theta. \]

Since

$
\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{2m} \left[ \frac{\partial P(\theta)}{\partial \theta} + \frac{\partial Q(\theta)}{\partial \theta} \right]
$

Substituting values we get

$
\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{2m} \left[ 2(X^\top X)\theta – 2X^\top y \right]
$
$
\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{m} X^T (X\theta – y)
$

Setting the derivative is equal to zero

$ X^T (X\theta - y) = 0 $

Solving for $ \theta $

$ X^T X\theta - X^T y = 0 $

$ X^T X\theta = X^T y $

$ \theta = \frac{X^T y}{X^T X} $

$ \theta = (X^T X)^{-1} X^T y $

This is the closed-form solution for the hypothesis function. This can be used to calculate a line fitting linearly related data points

Let’s use closed form to find a line that fits below x and y values

x	y
1	2
2	2.8
3	3.6

The design matrix $X$ and output vector $y$ are:

$ X = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}, \quad y = \begin{bmatrix} 2 \\ 2.8 \\ 3.6 \end{bmatrix} $

Step 1: Compute $X^T X$

$ X^T X = \begin{bmatrix} 3 & 6 \\ 6 & 14 \end{bmatrix} $

Step 2: Compute $X^T y$

$ X^T y = \begin{bmatrix} 8.4 \\ 18.8 \end{bmatrix} $

Step 3: Compute $(X^T X)^{-1}$

$ (X^T X)^{-1} = \begin{bmatrix} 7 & -3 \\ -3 & 1.5 \end{bmatrix} $

Step 4: Compute $\theta$

$ \theta = \begin{bmatrix} 7 & -3 \\ -3 & 1.5 \end{bmatrix} \times \begin{bmatrix} 8.4 \\ 18.8 \end{bmatrix} $

$ \theta = \begin{bmatrix} 1.2 \\ 0.8 \end{bmatrix} $

Final Result

The best-fit line equation is:

$ y = 1.2 + 0.8x $

Matrix form of the closed-form solution for the hypothesis function

Why divide by 2m ?

Vectorizing the hypothesis function

Square of residuals

Meaning

Compute \( \theta^T X^T y \)

Compute \( y^T X \theta \)

Solving for \( \theta \)

Leave a Comment Cancel Reply

Sign up to receive email updates, fresh news and more!

Why divide by 2m ?

Vectorizing the hypothesis function

Square of residuals

Meaning

Compute \( \theta^T X^T y \)

Compute \( y^T X \theta \)

Solving for \( \theta \)

Related Posts

Leave a Comment Cancel Reply