The cost function is a mathematical measure of how well a machine learning model’s predictions match the actual target values. It evaluates the overall error between predicted outputs and true outputs across the entire dataset.
For example, in linear regression, the cost function is typically the Mean Squared Error (MSE):
Where:
- : Total number of training examples.
- : Predicted value for the -th example.
- : Actual value for the i-th example.
The goal of training is to minimize by adjusting , ensuring the model makes accurate predictions
Why divide by 2m ?
Dividing by ensures that the cost function represents the average squared error, rather than the total squared error. This normalization helps keep the cost function independent of the dataset size.
The factor of 2 from squaring the residual cancels with the , making the gradient simpler and cleaner (more on this later).
Vectorizing the hypothesis function
Instead of writing the hypothesis for individual examples, we represent all predictions as a vector:
Where:
- : (n + 1) vector of parameters.
Square of residuals
The squared residuals are the element-wise square of
In matrix notation, we express this as the dot product:
How does above product is equal to sum of squared residuals?
Consider two vectors and :
- and are column vectors of the same size.
- Example:
Difference :
- Subtract corresponding elements of and :
Transpose of :
- Transpose flips from a column to a row vector:
- Transpose flips from a column to a row vector:
Dot Product with Itself:
- Multiply with :
- Multiply with :
Meaning
- This operation calculates the sum of the squared differences between corresponding elements of and .
- It represents the squared Euclidean distance between and .
To simplify, we’ll first distribute the transpose operation to the terms inside the parentheses. The transpose of a difference is the difference of the transposes, so we get:
Since
, we can rewrite the transpose as:
Now, substitute this back into the original expression for \(j (\theta)\) :
Now expand this product:
Let’s focus on \( \theta^TX^Ty \) and \(y^TX\theta \)
Imagine
X: A matrix of dimensions n x m
θ: A vector of dimensions m x 1
y: A vector of dimension n x 1
\( X\theta =
\begin{bmatrix}
x_{11} & x_{12} & \dots & x_{1m} \\
x_{21} & x_{22} & \dots & x_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n1} & x_{n2} & \dots & x_{nm}
\end{bmatrix}
\begin{bmatrix}
\theta_1 \\
\theta_2 \\
\vdots \\
\theta_m
\end{bmatrix} =
\begin{bmatrix}
\sum_{i=1}^m x_{1i}\theta_i \\
\sum_{i=1}^m x_{2i}\theta_i \\
\vdots \\
\sum_{i=1}^m x_{ni}\theta_i
\end{bmatrix} \)
\( y^T(X\theta) =
\begin{bmatrix}
y_1 & y_2 & \dots & y_n
\end{bmatrix}
\begin{bmatrix}
(X\theta)_1 \\
(X\theta)_2 \\
\vdots \\
(X\theta)_n
\end{bmatrix} =
\sum_{i=1}^n y_i \cdot (X\theta)_i \)
\( (X\theta)^T y =
\begin{bmatrix}
(X\theta)_1 & (X\theta)_2 & \dots & (X\theta)_n
\end{bmatrix}
\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_n
\end{bmatrix} =
\sum_{i=1}^n (X\theta)_i \cdot y_i \)
Both \(y^T(X\theta) \) and \( (X\theta)^Ty \) result in the same scalar value because the dot product is symmetric.
how \(y^T(X\theta) \) and \( (X\theta)^Ty \) result in the same scalar
Assume \( X \) is a \( 2 \times 3 \) matrix, \( \theta \) is a \( 3 \times 1 \) vector, and \( y \) is a \( 2 \times 1 \) vector:
\( X = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}, \quad \theta = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}, \quad y = \begin{bmatrix} 7 \\ 8 \end{bmatrix} \)
Compute \( \theta^T X^T y \)
Compute \( X^T y \):
\( X^T y = \begin{bmatrix} 39 \\ 54 \\ 69 \end{bmatrix} \)
Compute \( \theta^T X^T y \):
\( \theta^T X^T y = \newline (1)(39) + (1)(54) + (1)(69) = 162 \)
Compute \( y^T X \theta \)
Compute \( X \theta \):
\( X \theta = \begin{bmatrix} 6 \\ 15 \end{bmatrix} \)
Compute \( y^T X \theta \):
\( y^T X \theta = (7)(6) + (8)(15) = 162 \)
So this expression:
\( j(\theta) = \theta^T X^T X \theta – 2 (X \theta)^T y + y^T y \)
can be simplified to
\( J(\theta) = \frac{1}{2m} \left[ \theta^T X^T X \theta – 2 ( X \theta) ^T y + y^T y \right] \)
Let’s represent \( J(\theta) \) as :
Where
We can ignore the last term \( y^T y \) as it does not involve \( \theta \)
Let’s calculate derivative of \( J(\theta) \) with respect to \( \theta \) by calculating \( \frac{\partial P(\theta)}{\partial \theta} and \frac{\partial Q(\theta)}{\partial \theta} \)
and
\( \frac{\partial J(\theta)}{\partial \theta} = \frac{\partial P(\theta)}{\partial \theta} – \frac{\partial Q(\theta)}{\partial \theta} \)
\( Q(\theta) = 2 ( X \theta) ^T y \)
Using the example on the right, we can calculate
\( \frac{\partial Q(\theta)}{\partial \theta} = 2 X^T y \)
Imagine a 2×3 matrix for X and a 2×1 matrix for \( \theta \)
\( X = \begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix}, \quad \theta = \begin{bmatrix}
\theta_1 \\
\theta_2
\end{bmatrix} \)
\( X\theta = \begin{bmatrix}
1\cdot\theta_1 + 2\cdot\theta_2 \\
3\cdot\theta_1 + 4\cdot\theta_2
\end{bmatrix} = \begin{bmatrix}
\theta_1 + 2\theta_2 \\
3\theta_1 + 4\theta_2
\end{bmatrix} \)
The derivative of \(X\theta \) with respect to \( \theta \) measures how \( X\theta \) changes as each component of \( \theta \) changes.
\( \frac{\partial (X\theta)}{\partial \theta} =
\begin{bmatrix}
\frac{\partial (\theta_1 + 2\theta_2)}{\partial \theta_1} & \frac{\partial (\theta_1 + 2\theta_2)}{\partial \theta_2} \\
\frac{\partial (3\theta_1 + 4\theta_2)}{\partial \theta_1} & \frac{\partial (3\theta_1 + 4\theta_2)}{\partial \theta_2}
\end{bmatrix} \)
\( \frac{\partial (X\theta)}{\partial \theta} =
\begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix} = X \)
To calculate, \( \frac{\partial P(\theta)}{\partial \theta} \) where \( P(\theta) = \theta^T X^T X \theta \)
we need to use
\( \frac{\partial}{\partial \theta} \big( \theta^T A \theta \big) = 2A\theta \)
(Expand the proof of above quadratic form on the right, if it doesn’t make sense)
Using quadratic (above) form we can deduce:
\( \frac{\partial P(\theta)}{\partial \theta} = 2(X^\top X)\theta \)
derivative of quadratic form
We expand \( Q(\theta) = \theta^T A \theta \) using matrix-vector multiplication:
Performing the multiplication step-by-step:
Now multiply \( \theta^T \) with \( A \theta \):
Expanding:
The derivative \( \frac{\partial Q(\theta)}{\partial \theta} \) is:
From the expanded form:
Term by term:
- \( \frac{\partial}{\partial \theta_1}(2\theta_1^2) = 4\theta_1 \)
- \( \frac{\partial}{\partial \theta_1}(3\theta_1\theta_2) = 3\theta_2 \)
- \( \frac{\partial}{\partial \theta_1}(4\theta_2\theta_1) = 4\theta_2 \)
- \( \frac{\partial}{\partial \theta_1}(5\theta_2^2) = 0 \)
Combine:
Term by term:
- \( \frac{\partial}{\partial \theta_2}(2\theta_1^2) = 0 \)
- \( \frac{\partial}{\partial \theta_2}(3\theta_1\theta_2) = 3\theta_1 \)
- \( \frac{\partial}{\partial \theta_2}(4\theta_2\theta_1) = 4\theta_1 \)
- \( \frac{\partial}{\partial \theta_2}(5\theta_2^2) = 10\theta_2 \)
Combine:
The gradient is:
Rewriting using \( A \):
Thus:
Since
\(
\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{2m} \left[ \frac{\partial P(\theta)}{\partial \theta} + \frac{\partial Q(\theta)}{\partial \theta} \right]
\)
Substituting values we get
\(
\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{2m} \left[ 2(X^\top X)\theta – 2X^\top y \right]
\)
\(
\frac{\partial J(\theta)}{\partial \theta} = \frac{1}{m} X^T (X\theta – y)
\)
Setting the derivative is equal to zero
Solving for \( \theta \)
This is the closed-form solution for the hypothesis function. This can be used to calculate a line fitting linearly related data points
Let’s use closed form to find a line that fits below x and y values
x | y |
---|---|
1 | 2 |
2 | 2.8 |
3 | 3.6 |
The design matrix \(X\) and output vector \(y\) are:
\( X = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}, \quad y = \begin{bmatrix} 2 \\ 2.8 \\ 3.6 \end{bmatrix} \)
\( X^T X = \begin{bmatrix} 3 & 6 \\ 6 & 14 \end{bmatrix} \)
\( X^T y = \begin{bmatrix} 8.4 \\ 18.8 \end{bmatrix} \)
\( (X^T X)^{-1} = \begin{bmatrix} 7 & -3 \\ -3 & 1.5 \end{bmatrix} \)
\( \theta = \begin{bmatrix} 7 & -3 \\ -3 & 1.5 \end{bmatrix} \times \begin{bmatrix} 8.4 \\ 18.8 \end{bmatrix} \)
\( \theta = \begin{bmatrix} 1.2 \\ 0.8 \end{bmatrix} \)
The best-fit line equation is:
\( y = 1.2 + 0.8x \)