Simple Linear Regression

Keywords: linear regression

Notations of Linear Regression[1]

We have already created a simple linear model in the post “Introduction to Linear Regression”. y=w1x1+w2x2y=w_1x_1+w_2x_2 is a linear equation of both x=[x1  x2]T\boldsymbol{x}=[x_1 \; x_2]^T and w=[w1  w2]T\boldsymbol{w}=[w_1 \; w_2]^T. According to the definition of linear, we come up with the first simplest linear regression:

Yw1X+w0(1)Y\sim w_1X+w_0\tag{1}

where the symbol \sim is read as “is approximately modeled as”. Equation (1) can also be described as “regressing YY on XX(or YY onto XX)”.

The advertising example given at “Introduction to Linear Regression”. With the equation (1), we have a model of the budget of TV advertisement and sales:

Sales=w1×TV+w0(2)\text{Sales}=w_1\times \text{TV}+ w_0\tag{2}

From our knowledge of a line in the 2-D Cartesian coordinate system, w1w_1 is “slope” and w0w_0 is “intercept”. When we make sure the linear model practicable, the coefficients(parameters) should be estimated.

In equation (1), XX and YY are the observation of a population, in other words, they are inputs/outputs of the model. Assuming a machine, which can convert grain into flour, the input is the grain that is XX in equation (1), and the output, flour, is YY. Accordingly, w\boldsymbol{w} is the gears in the machine.

The hat symbol “  ^  \;\hat{}\;” is used to present this variable is a prediction, which means it is not the true value of the variable but a conjecture through a certain mathematical strategy or whatever method that seems reliable.

The duty of statistical learning or machine learning is to build a model or create a method to predict or investigate the relationship in the observation data basing on the observed data. Then the gears in the machine are predicted values, the output of a new coming input is also predicted value. So, the model we finally get is:

y=w1^x+w0^(3)y=\hat{w_1}x+\hat{w_0}\tag{3}

Then, the new coming input x0x_0 has its prediction:

y^0=w1^x0+w0^(4)\hat{y}_0=\hat{w_1}x_0+\hat{w_0}\tag{4}

Estimating the Coefficient(Parameters)

The notation and some basic concepts were talked above, then our mission is to estimate the parameters. For the advertisement task, what we have are a linear regression model and several observations pairs(input and respective output):

{(x1,y1),(x2,y2),(x3,y3),,(xn,yn)}(5)\{(x_1,y_1),(x_2,y_2),(x_3,y_3),\dots,(x_n,y_n)\}\tag{5}

which is also known as training set. By the way, xix_i in equation (5) is a measurement of XX and so is yiy_i of YY. nn is the size of the training set, some observations pairs.

The method we employed here is based on a measure of the closeness of the model to the observed data. By far, the most used method is the “least squares criterion”.

When we have a candidate w\boldsymbol{w}, we can get the corresponding output y^i\hat{y}_i to every input xix_i :

{(x1,y^1),(x2,y^2),(x3,y^3),,(xn,y^n)}(6)\{(x_1,\hat{y}_1),(x_2,\hat{y}_2),(x_3,\hat{y}_3),\dots,(x_n,\hat{y}_n)\}\tag{6}

and the difference between y^i\hat{y}_i and yiy_i is called residual and written as eie_i:

ei=yiy^i(7)e_i=y_i-\hat{y}_i\tag{7}

yiy_i is the observation, which is the value our model is trying to achieve. So, smaller the ei|e_i| is, the better model is. For the absolute operation is not a good analytic operation, so we replace it with the quadratic operation:

RSS=e12+e22++en2(8)\text{RSS}=e_1^2+e_2^2+\dots+e_n^2\tag{8}

RSS means “Residual Sum of Squares”, the sum of total square residual. The model, linear regression, has less RSS, is better than the one that has a larger RSS.
Take eqation(4)(7) into (8):

RSS=(y1w1^x1w0^)2+(y2w1^x2w0^)2++(ynw1^xnw0^)2=i=1n(yiw1^xiw0^)2(9)\begin{aligned} \text{RSS}=&(y_1-\hat{w_1}x_1-\hat{w_0})^2+(y_2-\hat{w_1}x_2-\hat{w_0})^2+\\ &\dots+(y_n-\hat{w_1}x_n-\hat{w_0})^2\\ =&\sum_{i=1}^n(y_i-\hat{w_1}x_i-\hat{w_0})^2 \end{aligned}\tag{9}

To minimize the function “RSS\text{RSS}”, the calculus told us the possible minimum points always stay at stationary points. And the stationary points are the points where the derivative of the function is zero. Remember that the minimum points must be stationary points, but the stationary point is not necessary to be a minimum point.

Since the ‘RSS\text{RSS}’ is a function of w0w_0 and w1w_1, the derivative is replaced by partial derivative. As the ‘RSS\text{RSS}’ for this linear combination is just a simple quadric surface, the minimum or maximum must exist, and there is only one stationary point. Then our mission to find the best parameters for the regression has been converted to calculus the solution of function that the derivative(partial derivative) is set to zero.

THe partial derivative of w1^\hat{w_1} is

RSSw1^=2i=1nxi(yiw1^xiw0^)=2(i=1nxiyiw1^i=1nxi2w0^i=1nxi)(10)\begin{aligned} \frac{\partial{\text{RSS}}}{\partial{\hat{w_1}}}=&-2\sum_{i=1}^nx_i(y_i-\hat{w_1}x_i-\hat{w_0})\\ =&-2(\sum_{i=1}^nx_iy_i-\hat{w_1}\sum_{i=1}^nx_i^2-\hat{w_0}\sum_{i=1}^nx_i) \end{aligned}\tag{10}

and derivative of w0^\hat{w_0} is:

RSSw0^=2i=1n(yiw1^xiw0^)=2(i=1nyiw1^i=1nxii=1nw0^)(11)\begin{aligned} \frac{\partial{\text{RSS}}}{\partial{\hat{w_0}}}=&-2\sum_{i=1}^n(y_i-\hat{w_1}x_i-\hat{w_0})\\ =&-2(\sum_{i=1}^ny_i-\hat{w_1}\sum_{i=1}^nx_i-\sum_{i=1}^n\hat{w_0}) \end{aligned}\tag{11}

Set both of them to zero and we can get:

RSSw0^=0w0^=i=1nyiw1^i=1nxin=yˉw1^xˉ(12)\begin{aligned} \frac{\partial{\text{RSS}}}{\partial{\hat{w_0}}}&=0\\ \hat{w_0} &=\frac{\sum_{i=1}^ny_i-\hat{w_1}\sum_{i=1}^nx_i}{n}\\ &=\bar{y}-\hat{w_1}\bar{x} \end{aligned}\tag{12}

and

RSSw1^=0w1^=i=1nxiyiw0^i=1nxii=1nxi2(13)\begin{aligned} \frac{\partial{\text{RSS}}}{\partial{\hat{w_1}}}&=0\\ \hat{w_1}&=\frac{\sum_{i=1}^nx_iy_i-\hat{w_0}\sum_{i=1}^nx_i}{\sum_{i=1}^nx_i^2} \end{aligned}\tag{13}

To get a equation of w1^\hat{w_1} independently, we take equation(13) to equation(12):

RSSw1^=0w1^=i=1nxi(yiyˉ)i=1nxi(xixˉ)(14)\begin{aligned} \frac{\partial{\text{RSS}}}{\partial{\hat{w_1}}}&=0\\ \hat{w_1}&=\frac{\sum_{i=1}^nx_i(y_i-\bar{y})}{\sum_{i=1}^nx_i(x_i-\bar{x})} \end{aligned}\tag{14}

where xˉ=i=1nxin\bar{x}=\frac{\sum_{i=1}^nx_i}{n} and yˉ=i=1nyin\bar{y}=\frac{\sum_{i=1}^ny_i}{n}

By the way, equation (14) has another form

w1^=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)(xixˉ)(15)\hat{w_1}=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})(x_i-\bar{x})}\tag{15}

and they are equal.

Diagrams and Code

Using python to demonstrate our result Equ. (12)(14) is correct:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# load data from csv file by pandas
AdvertisingFilepath='./data/Advertising.csv'
data=pd.read_csv(AdvertisingFilepath)

# convert original data to numpy array
data_TV=np.array(data['TV'])
data_sale=np.array(data['sales'])

# calculate mean of x and y
y_sum=0
y_mean=0
x_sum=0
x_mean=0
for x,y in zip(data_TV,data_sale):
y_sum+=y
x_sum+=x
if len(data_sale)!=0:
y_mean=y_sum/len(data_sale)
if len(data_TV)!=0:
x_mean=x_sum/len(data_TV)

# calculate w_1
w_1=0
a=0
b=0
for x,y in zip(data_TV,data_sale):
a += x*(y-y_mean)
b += x*(x-x_mean)
if b!=0:
w_1=a/b

# calculate w_0
w_0=y_mean-w_1*x_mean

# draw a picture
plt.xlabel('TV')
plt.ylabel('Sales')
plt.title('TV and Sales')
plt.scatter(data_TV,data_sale,s=8,c='g', alpha=0.5)
x=np.arange(-10,350,0.1)
plt.plot(x,w_1*x+w_0,'r-')
plt.show()

After runing the code, we got:

Reference


  1. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013. ↩︎