# Simple Linear Regression

**Keywords:** linear regression

## Notations of Linear Regression^{[1]}

We have already created a simple linear model in the post “Introduction to Linear Regression”. $y=w_1x_1+w_2x_2$ is a linear equation of both $\boldsymbol{x}=[x_1 \; x_2]^T$ and $\boldsymbol{w}=[w_1 \; w_2]^T$. According to the definition of linear, we come up with the first simplest linear regression:

$Y\sim w_1X+w_0\tag{1}$

where the symbol $\sim$ is read as “is approximately modeled as”. Equation (1) can also be described as “regressing $Y$ on $X$(or $Y$ onto $X$)”.

The advertising example given at “Introduction to Linear Regression”. With the equation (1), we have a model of the budget of TV advertisement and sales:

$\text{Sales}=w_1\times \text{TV}+ w_0\tag{2}$

From our knowledge of a line in the 2-D Cartesian coordinate system, $w_1$ is “slope” and $w_0$ is “intercept”. When we make sure the linear model practicable, the coefficients(parameters) should be estimated.

In equation (1), $X$ and $Y$ are the observation of a population, in other words, they are inputs/outputs of the model. Assuming a machine, which can convert grain into flour, the input is the grain that is $X$ in equation (1), and the output, flour, is $Y$. Accordingly, $\boldsymbol{w}$ is the gears in the machine.

The hat symbol “$\;\hat{}\;$” is used to present this variable is a prediction, which means it is not the true value of the variable but a conjecture through a certain mathematical strategy or whatever method that seems reliable.

The duty of statistical learning or machine learning is to build a model or create a method to predict or investigate the relationship in the observation data basing on the **observed data**. Then the gears in the machine are predicted values, the output of a new coming input is also predicted value. So, the model we finally get is:

$y=\hat{w_1}x+\hat{w_0}\tag{3}$

Then, the new coming input $x_0$ has its prediction:

$\hat{y}_0=\hat{w_1}x_0+\hat{w_0}\tag{4}$

## Estimating the Coefficient(Parameters)

The notation and some basic concepts were talked above, then our mission is to estimate the parameters. For the advertisement task, what we have are a linear regression model and several observations pairs(input and respective output):

$\{(x_1,y_1),(x_2,y_2),(x_3,y_3),\dots,(x_n,y_n)\}\tag{5}$

which is also known as **training set**. By the way, $x_i$ in equation (5) is a measurement of $X$ and so is $y_i$ of $Y$. $n$ is the size of the training set, some observations pairs.

The method we employed here is based on a measure of the closeness of the model to the observed data. By far, the most used method is the “least squares criterion”.

When we have a candidate $\boldsymbol{w}$, we can get the corresponding output $\hat{y}_i$ to every input $x_i$ :

$\{(x_1,\hat{y}_1),(x_2,\hat{y}_2),(x_3,\hat{y}_3),\dots,(x_n,\hat{y}_n)\}\tag{6}$

and the difference between $\hat{y}_i$ and $y_i$ is called **residual** and written as $e_i$:

$e_i=y_i-\hat{y}_i\tag{7}$

$y_i$ is the observation, which is the value our model is trying to achieve. So, smaller the $|e_i|$ is, the better model is. For the absolute operation is not a good analytic operation, so we replace it with the quadratic operation:

$\text{RSS}=e_1^2+e_2^2+\dots+e_n^2\tag{8}$

RSS means “Residual Sum of Squares”, the sum of total square residual. The model, linear regression, has less RSS, is better than the one that has a larger RSS.

Take eqation(4)(7) into (8):

$\begin{aligned} \text{RSS}=&(y_1-\hat{w_1}x_1-\hat{w_0})^2+(y_2-\hat{w_1}x_2-\hat{w_0})^2+\\ &\dots+(y_n-\hat{w_1}x_n-\hat{w_0})^2\\ =&\sum_{i=1}^n(y_i-\hat{w_1}x_i-\hat{w_0})^2 \end{aligned}\tag{9}$

To minimize the function “$\text{RSS}$”, the calculus told us the possible minimum points always stay at stationary points. And the stationary points are the points where the derivative of the function is zero. Remember that the minimum points must be stationary points, but the stationary point is not necessary to be a minimum point.

Since the ‘$\text{RSS}$’ is a function of $w_0$ and $w_1$, the derivative is replaced by partial derivative. As the ‘$\text{RSS}$’ for this linear combination is just a simple quadric surface, the minimum or maximum must exist, and there is only one stationary point. Then our mission to find the best parameters for the regression has been converted to calculus the solution of function that the derivative(partial derivative) is set to zero.

THe partial derivative of $\hat{w_1}$ is

$\begin{aligned} \frac{\partial{\text{RSS}}}{\partial{\hat{w_1}}}=&-2\sum_{i=1}^nx_i(y_i-\hat{w_1}x_i-\hat{w_0})\\ =&-2(\sum_{i=1}^nx_iy_i-\hat{w_1}\sum_{i=1}^nx_i^2-\hat{w_0}\sum_{i=1}^nx_i) \end{aligned}\tag{10}$

and derivative of $\hat{w_0}$ is:

$\begin{aligned} \frac{\partial{\text{RSS}}}{\partial{\hat{w_0}}}=&-2\sum_{i=1}^n(y_i-\hat{w_1}x_i-\hat{w_0})\\ =&-2(\sum_{i=1}^ny_i-\hat{w_1}\sum_{i=1}^nx_i-\sum_{i=1}^n\hat{w_0}) \end{aligned}\tag{11}$

Set both of them to zero and we can get:

$\begin{aligned} \frac{\partial{\text{RSS}}}{\partial{\hat{w_0}}}&=0\\ \hat{w_0} &=\frac{\sum_{i=1}^ny_i-\hat{w_1}\sum_{i=1}^nx_i}{n}\\ &=\bar{y}-\hat{w_1}\bar{x} \end{aligned}\tag{12}$

and

$\begin{aligned} \frac{\partial{\text{RSS}}}{\partial{\hat{w_1}}}&=0\\ \hat{w_1}&=\frac{\sum_{i=1}^nx_iy_i-\hat{w_0}\sum_{i=1}^nx_i}{\sum_{i=1}^nx_i^2} \end{aligned}\tag{13}$

To get a equation of $\hat{w_1}$ independently, we take equation(13) to equation(12):

$\begin{aligned} \frac{\partial{\text{RSS}}}{\partial{\hat{w_1}}}&=0\\ \hat{w_1}&=\frac{\sum_{i=1}^nx_i(y_i-\bar{y})}{\sum_{i=1}^nx_i(x_i-\bar{x})} \end{aligned}\tag{14}$

where $\bar{x}=\frac{\sum_{i=1}^nx_i}{n}$ and $\bar{y}=\frac{\sum_{i=1}^ny_i}{n}$

By the way, equation (14) has another form

$\hat{w_1}=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})(x_i-\bar{x})}\tag{15}$

and they are equal.

## Diagrams and Code

Using python to demonstrate our result Equ. (12)(14) is correct:

1 | import numpy as np |

After runing the code, we got:

## Reference

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013. ↩︎