Linear regression from basic to matrix form

Linear regression is one of the simplest and most powerful tools in data science. Let’s start with the classic equation you’ve likely seen in school and build up to a more generalized, multi-dimensional matrix representation. This transition will help us handle multiple features and datasets in an elegant way, which is key to applying machine learning in the real world.

At its core, linear regression models a straight-line relationship between an input x and an output y.

\( y = mx + b \)

Here:

  • m is the slope, which represents the rate of change of y with respect to x
  • b  is the intercept, which represents the value of y when x = 0

Imagine you are trying to model a relationship between a set of inputs (features) and an output (target). This is a typical supervised learning problem, where you have a dataset of examples, each of which includes:

  • Input features x (independent variables).
  • The corresponding output y (dependent variable, or target).

In simple linear regression, you model the relationship between one input feature x and the output y as:

\( y = mx + b \)

Where:

  • m is the slope (how much y changes for a unit change in x),
  • b is the intercept (the value of y when x = 0).

Extending to multiple features

What if you have multiple features, not just one? For example, let’s say you want to predict y using two features, x_1 and x_2. In this case, the equation becomes:

\( y = w_1x_1 + w_2x_2 + b \)

Where:

  • w_1 is the weight (coefficient) for x_1,
  • w_2 is the weight for x_2,
  • b is the intercept.

This is a multiple linear regression model, where you have more than one input feature. Now, if you had more than two features (say, x_1, x_2, x_3, ..., x_n), the equation would continue to expand like this:

\( y = w_1x_1 + w_2x_2 + ... + w_nx_n + b \)

Matrix Formulation: Generalizing to \( y = Xw \)

The next step is to rewrite this system in a more compact and convenient way, especially when working with datasets that involve multiple data points.

Suppose you have a dataset with m data points, where each data point consists of several input features. For simplicity, let’s assume you have two features for each data point, so your dataset might look like this:

Data Point Feature 1 (x₁) Feature 2 (x₂) Target (y)
1 x₁₁ x₁₂ y₁
2 x₂₁ x₂₂ y₂
... ... ... ...
m xₘ₁ xₘ₂ yₘ

Matrix Representation

1. Feature Matrix X: The matrix X represents the input features of the dataset. Each row of X corresponds to one data point, and each column corresponds to a feature (including the intercept term).

If we include a column of 1s for the intercept term, the matrix X looks like this for m data points and n features:

\( X = \begin{pmatrix} 1 & x_{11} & x_{12} \\ 1 & x_{21} & x_{22} \\ \vdots & \vdots & \vdots \\ 1 & x_{m1} & x_{m2} \end{pmatrix} \)

2. Weight Vector w: The weight vector w contains the coefficients (weights) for the features, including the intercept term. It is a column vector:

\( w = \begin{pmatrix} b \\ w_1 \\ w_2 \end{pmatrix} \)

3. Target Vector y: The target vector y contains the values of the output for each data point:

\( y = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{pmatrix} \)

Matrix Equation:

Now, we can write the prediction equation for all m data points in a single matrix equation:

\( y = Xw \)

Why Use Matrix Formulation?

Matrix notation makes it easier to:

  • Work with multiple data points at once.
  • Perform efficient computations when dealing with large datasets.
  • Apply linear algebra techniques for solving for the weights w using methods like the normal equation or gradient descent.

In the case of linear regression, we can find the optimal values for w (including the intercept b and the feature weights) using optimization techniques, such as minimizing the mean squared error (MSE) between the predicted and actual target values.

The form y = Xw is a compact and generalized way to represent linear models, where:

  • X is the matrix of input features (with a bias column for the intercept),
  • w is the vector of learned weights (coefficients),
  • y is the vector of predicted values (or target values).

This formulation is extremely useful in machine learning because it allows us to represent and compute predictions efficiently for datasets with many features and data points. It also sets the stage for more advanced techniques like gradient descent and matrix inversion to find the optimal weights.

Transitioning to \( y = Xθ \)

Now, let's relate this to the equation \( y = X\theta \). The transition here is primarily a change in notation rather than a fundamental difference in the underlying mathematics.

In the context of linear regression:

\( w \) and \( \theta \) are both used to represent the weights or coefficients of the linear model. The notation \( \theta \) is typically used in machine learning to represent the parameters of the model, including both the intercept and feature coefficients.

The matrix \( X \) still represents the input features (with a column of 1s for the intercept).

\( y \) still represents the target values.

Thus, the equation can be rewritten as:

\[ y = X\theta \]

Where:

  • \( \theta \) is the vector of parameters, which contains the intercept and the coefficients for each feature.

The components of the equation are:

  • \( \theta \) now refers to the vector of parameters (which is equivalent to w in the earlier equation). This includes:
    • The intercept term \( \theta_0 \),
    • The coefficients for the features \( \theta_1, \theta_2, \dots, \theta_n \).
  • X is the matrix of input features, including the column of 1s to account for the intercept term.

Why Use \( \theta \) Instead of \( w \)

The choice of \( \theta \) instead of \( w \) is largely a matter of convention:

  • In machine learning literature, the vector of parameters is often referred to as \( \theta \), especially in the context of algorithms such as gradient descent or when dealing with regularized regression (e.g., Ridge, Lasso).
  • The form \( y = X\theta \) makes it easier to generalize the model to more complex cases, such as logistic regression, where the target variable is categorical, or in the case of neural networks where the notation for weights extends naturally to multi-layer models.

 In data science another form of equation is frequently used

\( h(\theta) = X \theta \)  OR \(h_\theta(\mathbf{X}) = \mathbf{X} \boldsymbol{\theta}
\)

where \(h(\theta)\) is called a hypothesis function. Using \(h(\theta) \) also helps us generalize the concept of hypothesis testing or prediction to any model, not just linear regression. In more complex models (e.g., logistic regression or neural networks), the hypothesis function is extended to incorporate non-linearities.

In short, \( h(\theta) = X \theta \) represents the prediction function (hypothesis) for the model, and it’s used to compare against the true values to optimize the parameters during the learning process.

Leave a Comment

Your email address will not be published. Required fields are marked *