Matrix form of the closed-form solution for the hypothesis function

The cost function J(θ)J(\theta) is a mathematical measure of how well a machine learning model’s predictions match the actual target values. It evaluates the overall error between predicted outputs hθ(x)h_\theta(x) and true outputs yy across the entire dataset.

For example, in linear regression, the cost function is typically the Mean Squared Error (MSE):

J(θ)=12mi=1m(hθ(x(i))y(i))2

Where:

  • mm: Total number of training examples.
  • hθ(x(i))h_\theta(x^{(i)}): Predicted value for the ii-th example.
  • y(i)y^{(i)}: Actual value for the ii-th example.

The goal of training is to minimize J(θ)J(\theta) by adjusting θ\theta, ensuring the model makes accurate predictions

\( J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 \)
Why divide by 2m ?

Dividing by mm ensures that the cost function represents the average squared error, rather than the total squared error. This normalization helps keep the cost function independent of the dataset size.

The factor of 2 from squaring the residual cancels with the 12\frac{1}{2}, making the gradient simpler and cleaner (more on this later).

Vectorizing the hypothesis function

Instead of writing the hypothesis for individual examples, we represent all predictions as a vector:

hθ(X)=Xθ

Where:

  • X: m x (n + 1)matrix of all training examples (including a bias term).
  • θ : (n + 1)  vector of parameters.
Square of residuals

The squared residuals are the element-wise square of

 \(X \theta – y  \)X\theta – y

In matrix notation, we express this as the dot product:

Sum of squared residuals= \( (X\theta - y)^T (X\theta - y) \)

 

  1. Consider two vectors AA and BB:

    • AA and BB are column vectors of the same size.
    • Example: A=[a1a2a3]A = \begin{bmatrix} a_1 \\ a_2 \\ a_3 \end{bmatrix}  B=[b1b2b3]B = \begin{bmatrix} b_1 \\ b_2 \\ b_3 \end{bmatrix}
  2. Difference ABA – B:

    • Subtract corresponding elements of AA and BB: AB=[a1b1a2b2a3b3]A – B = \begin{bmatrix} a_1 – b_1 \\ a_2 – b_2 \\ a_3 – b_3 \end{bmatrix}
  3. Transpose of ABA – B:

    • Transpose flips ABA – B from a column to a row vector: (AB)T=[a1b1a2b2a3b3](A – B)^T = \begin{bmatrix} a_1 – b_1 & a_2 – b_2 & a_3 – b_3 \end{bmatrix}
  4. Dot Product with Itself:

    • Multiply (AB)T(A – B)^T with (AB)(A – B): (AB)T(AB)=(a1b1)2+(a2b2)2+(a3b3)2(A – B)^T (A – B) = (a_1 – b_1)^2 + (a_2 – b_2)^2 + (a_3 – b_3)^2

Meaning

  • This operation calculates the sum of the squared differences between corresponding elements of AA and BB.
  • It represents the squared Euclidean distance between AA and BB.

To simplify, we’ll first distribute the transpose operation to the terms inside the parentheses. The transpose of a difference is the difference of the transposes, so we get:

(Xθy)T=(Xθ)TyT

Since

(Xθ)T=θTXT(X\theta)^T = \theta^T X^T

, we can rewrite the transpose as:

(Xθy)T=θTXTyT

 

Now, substitute this back into the original expression for \(j (\theta)\) :

J(θ)=(θTXTyT)(Xθy)j(\theta) = (\theta^T X^T – y^T)(X\theta – y)

Now expand this product:

J(θ)=θTXTXθθTXTyyTXθ+yTyj(\theta) = \theta^T X^T X \theta – \theta^T X^T y – y^T X \theta + y^T y

 

Let’s focus on \( \theta^TX^Ty \) and \(y^TX\theta \) 

Imagine
X: A matrix of dimensions n x m
θ: A vector of dimensions m x 1
: A vector of dimension n x 1

\( X\theta =
\begin{bmatrix}
x_{11} & x_{12} & \dots & x_{1m} \\
x_{21} & x_{22} & \dots & x_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n1} & x_{n2} & \dots & x_{nm}
\end{bmatrix}
\begin{bmatrix}
\theta_1 \\
\theta_2 \\
\vdots \\
\theta_m
\end{bmatrix} =
\begin{bmatrix}
\sum_{i=1}^m x_{1i}\theta_i \\
\sum_{i=1}^m x_{2i}\theta_i \\
\vdots \\
\sum_{i=1}^m x_{ni}\theta_i
\end{bmatrix} \)

\( y^T(X\theta) =
\begin{bmatrix}
y_1 & y_2 & \dots & y_n
\end{bmatrix}
\begin{bmatrix}
(X\theta)_1 \\
(X\theta)_2 \\
\vdots \\
(X\theta)_n
\end{bmatrix} =
\sum_{i=1}^n y_i \cdot (X\theta)_i \)

\( (X\theta)^T y =
\begin{bmatrix}
(X\theta)_1 & (X\theta)_2 & \dots & (X\theta)_n
\end{bmatrix}
\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_n
\end{bmatrix} =
\sum_{i=1}^n (X\theta)_i \cdot y_i \)

Both \(y^T(X\theta) \) and \( (X\theta)^Ty \) result in the same scalar value because the dot product is symmetric.

Assume \( X \) is a \( 2 \times 3 \) matrix, \( \theta \) is a \( 3 \times 1 \) vector, and \( y \) is a \( 2 \times 1 \) vector:

\( X = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}, \quad \theta = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}, \quad y = \begin{bmatrix} 7 \\ 8 \end{bmatrix} \)

Compute \( \theta^T X^T y \)

Compute \( X^T y \):

\( X^T y = \begin{bmatrix} 39 \\ 54 \\ 69 \end{bmatrix} \)

Compute \( \theta^T X^T y \):

\( \theta^T X^T y = \newline (1)(39) + (1)(54) + (1)(69) = 162 \)

Compute \( y^T X \theta \)

Compute \( X \theta \):

\( X \theta = \begin{bmatrix} 6 \\ 15 \end{bmatrix} \)

Compute \( y^T X \theta \):

\( y^T X \theta = (7)(6) + (8)(15) = 162 \)

Since \( \theta^T X^T y = y^T X \theta \), both terms evaluate to 162, showing that they are equal and can be combined in the expression for \( j(\theta) \).

So this expression:
\( j(\theta) = \theta^T X^T X \theta – 2 (X \theta)^T y + y^T y \)

can be simplified to

\( J(\theta) = \frac{1}{2m} \left[ \theta^T X^T X \theta – 2 ( X  \theta) ^T y + y^T y \right] \)

Let’s represent \( J(\theta) \) as : 

\( J(\theta) = \frac{1}{2m} \left[\color{red}{P(\theta)} + \color{green}{Q(\theta)} + \color{blue}{y^Ty} \right] \)


Where

\(  \color{red}{P(\theta) =  \theta^T X^T X \theta} \) and \( \color{green}{Q(\theta)  =  – 2 ( X \theta) ^T y} \) 

We can ignore the last term \( y^T y \) as it does not involve \( \theta \) 

Let’s calculate derivative of \( J(\theta) \) with respect to \( \theta \) by calculating \( \frac{\partial P(\theta)}{\partial \theta} and \frac{\partial Q(\theta)}{\partial \theta} \)

and 

\( \frac{\partial J(\theta)}{\partial \theta} = \frac{\partial P(\theta)}{\partial \theta} – \frac{\partial Q(\theta)}{\partial \theta} \)

\( Q(\theta)  =  2 ( X \theta) ^T y \)

Using the example on the right, we can calculate

\( \frac{\partial Q(\theta)}{\partial \theta} = 2 X^T y \)

Imagine a 2×3 matrix for X and a 2×1 matrix for \( \theta \)  
\( X = \begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix}, \quad \theta = \begin{bmatrix}
\theta_1 \\
\theta_2
\end{bmatrix} \)

\( X\theta = \begin{bmatrix}
1\cdot\theta_1 + 2\cdot\theta_2 \\
3\cdot\theta_1 + 4\cdot\theta_2
\end{bmatrix} = \begin{bmatrix}
\theta_1 + 2\theta_2 \\
3\theta_1 + 4\theta_2
\end{bmatrix} \)

The derivative of \(X\theta \) with respect to \( \theta \) measures how \( X\theta \) changes as each component of \( \theta \) changes.

\( \frac{\partial (X\theta)}{\partial \theta} =
\begin{bmatrix}
\frac{\partial (\theta_1 + 2\theta_2)}{\partial \theta_1} & \frac{\partial (\theta_1 + 2\theta_2)}{\partial \theta_2} \\
\frac{\partial (3\theta_1 + 4\theta_2)}{\partial \theta_1} & \frac{\partial (3\theta_1 + 4\theta_2)}{\partial \theta_2}
\end{bmatrix} \)

\( \frac{\partial (X\theta)}{\partial \theta} =
\begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix} = X \)

To calculate, \( \frac{\partial P(\theta)}{\partial \theta} \) where  \(  P(\theta) =  \theta^T X^T X \theta \) 
we need to use

\( \frac{\partial}{\partial \theta} \big( \theta^T A \theta \big) = 2A\theta \)   

(Expand the proof of above quadratic form on the right, if it doesn’t make sense)
Using quadratic (above) form we can deduce:

\( \frac{\partial P(\theta)}{\partial \theta} = 2(X^\top X)\theta \)

Step 1: Expand \( Q(\theta) \)

We expand \( Q(\theta) = \theta^T A \theta \) using matrix-vector multiplication:

\[ A = \begin{bmatrix} 2 & 3 \\ 4 & 5 \end{bmatrix}, \quad \theta = \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} \]

Performing the multiplication step-by-step:

\[ A \theta = \begin{bmatrix} 2 & 3 \\ 4 & 5 \end{bmatrix} \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} = \begin{bmatrix} 2\theta_1 + 3\theta_2 \\ 4\theta_1 + 5\theta_2 \end{bmatrix} \]

Now multiply \( \theta^T \) with \( A \theta \):

\[ Q(\theta) = \begin{bmatrix} \theta_1 & \theta_2 \end{bmatrix} \begin{bmatrix} 2\theta_1 + 3\theta_2 \\ 4\theta_1 + 5\theta_2 \end{bmatrix} \]

Expanding:

\[ Q(\theta) = 2\theta_1^2 + 3\theta_1\theta_2 + 4\theta_2\theta_1 + 5\theta_2^2 \]
Step 2: Compute the Derivative

The derivative \( \frac{\partial Q(\theta)}{\partial \theta} \) is:

\[ \frac{\partial Q(\theta)}{\partial \theta} = \begin{bmatrix} \frac{\partial Q(\theta)}{\partial \theta_1} \\ \frac{\partial Q(\theta)}{\partial \theta_2} \end{bmatrix} \]
Step 2a: Derivative with respect to \( \theta_1 \)

From the expanded form:

\[ Q(\theta) = 2\theta_1^2 + 3\theta_1\theta_2 + 4\theta_2\theta_1 + 5\theta_2^2 \]

Term by term:

  • \( \frac{\partial}{\partial \theta_1}(2\theta_1^2) = 4\theta_1 \)
  • \( \frac{\partial}{\partial \theta_1}(3\theta_1\theta_2) = 3\theta_2 \)
  • \( \frac{\partial}{\partial \theta_1}(4\theta_2\theta_1) = 4\theta_2 \)
  • \( \frac{\partial}{\partial \theta_1}(5\theta_2^2) = 0 \)

Combine:

\[ \frac{\partial Q(\theta)}{\partial \theta_1} = 4\theta_1 + 3\theta_2 + 4\theta_2 = 4\theta_1 + 7\theta_2 \]
Step 2b: Derivative with respect to \( \theta_2 \)

Term by term:

  • \( \frac{\partial}{\partial \theta_2}(2\theta_1^2) = 0 \)
  • \( \frac{\partial}{\partial \theta_2}(3\theta_1\theta_2) = 3\theta_1 \)
  • \( \frac{\partial}{\partial \theta_2}(4\theta_2\theta_1) = 4\theta_1 \)
  • \( \frac{\partial}{\partial \theta_2}(5\theta_2^2) = 10\theta_2 \)

Combine:

\[ \frac{\partial Q(\theta)}{\partial \theta_2} = 3\theta_1 + 4\theta_1 + 10\theta_2 = 7\theta_1+ 10\theta_2 \]
Step 3: Combine Results

The gradient is:

\[ \frac{\partial Q(\theta)}{\partial \theta} = \begin{bmatrix} 4\theta_1 + 7\theta_2 \\ 7\theta_1 + 10\theta_2 \end{bmatrix} \]
Step 4: Matrix Form

Rewriting using \( A \):

\[ 2A = \begin{bmatrix} 4 & 6 \\ 8 & 10 \end{bmatrix}, \quad 2A\theta = \begin{bmatrix} 4\theta_1 + 7\theta_2 \\ 7\theta_1 + 10\theta_2 \end{bmatrix} \]

Thus:

\[ \frac{\partial Q(\theta)}{\partial \theta} = 2A\theta. \]

Since 

\(
\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{2m} \left[ \frac{\partial P(\theta)}{\partial \theta} + \frac{\partial Q(\theta)}{\partial \theta} \right]
\)

Substituting values we get

\(
\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{2m} \left[ 2(X^\top X)\theta – 2X^\top y \right]
\)
\(
\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{m} X^T (X\theta – y)
\)

Setting the derivative is equal to zero

\( X^T (X\theta - y) = 0 \)

Solving for \( \theta \)

\( X^T X\theta - X^T y = 0 \)
\( X^T X\theta = X^T y \)
\( \theta = \frac{X^T y}{X^T X} \)
\( \theta = (X^T X)^{-1} X^T y \)

This is the closed-form solution for the hypothesis function. This can be used to calculate a line fitting linearly related data points

Let’s use closed form to find a line that fits below x and y values

xy
12
22.8
33.6

The design matrix \(X\) and output vector \(y\) are:

\( X = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}, \quad y = \begin{bmatrix} 2 \\ 2.8 \\ 3.6 \end{bmatrix} \)

Step 1: Compute \(X^T X\)

\( X^T X = \begin{bmatrix} 3 & 6 \\ 6 & 14 \end{bmatrix} \)

Step 2: Compute \(X^T y\)

\( X^T y = \begin{bmatrix} 8.4 \\ 18.8 \end{bmatrix} \)

Step 3: Compute \((X^T X)^{-1}\)

\( (X^T X)^{-1} = \begin{bmatrix} 7 & -3 \\ -3 & 1.5 \end{bmatrix} \)

Step 4: Compute \(\theta\)

\( \theta = \begin{bmatrix} 7 & -3 \\ -3 & 1.5 \end{bmatrix} \times \begin{bmatrix} 8.4 \\ 18.8 \end{bmatrix} \)

\( \theta = \begin{bmatrix} 1.2 \\ 0.8 \end{bmatrix} \)

Final Result

The best-fit line equation is:

\( y = 1.2 + 0.8x \)

Leave a Comment

Your email address will not be published. Required fields are marked *