Keywords: linear regression
Notations of Linear Regression
We have already created a simple linear model in the post “Introduction to Linear Regression”. is a linear equation of both and . According to the definition of linear, we come up with the first simplest linear regression:
where the symbol is read as “is approximately modeled as”. Equation (1) can also be described as “regressing on (or onto )”.
The advertising example given at “Introduction to Linear Regression”. With the equation (1), we have a model of the budget of TV advertisement and sales:
From our knowledge of a line in the 2-D Cartesian coordinate system, is “slope” and is “intercept”. When we make sure the linear model practicable, the coefficients(parameters) should be estimated.
In equation (1), and are the observation of a population, in other words, they are inputs/outputs of the model. Assuming a machine, which can convert grain into flour, the input is the grain that is in equation (1), and the output, flour, is . Accordingly, is the gears in the machine.
The hat symbol “” is used to present this variable is a prediction, which means it is not the true value of the variable but a conjecture through a certain mathematical strategy or whatever method that seems reliable.
The duty of statistical learning or machine learning is to build a model or create a method to predict or investigate the relationship in the observation data basing on the observed data. Then the gears in the machine are predicted values, the output of a new coming input is also predicted value. So, the model we finally get is:
Then, the new coming input has its prediction:
The notation and some basic concepts were talked above, then our mission is to estimate the parameters. For the advertisement task, what we have are a linear regression model and several observations pairs(input and respective output):
which is also known as training set. By the way, in equation (5) is a measurement of and so is of . is the size of the training set, some observations pairs.
The method we employed here is based on a measure of the closeness of the model to the observed data. By far, the most used method is the “least squares criterion”.
When we have a candidate , we can get the corresponding output to every input :
and the difference between and is called residual and written as :
is the observation, which is the value our model is trying to achieve. So, smaller the is, the better model is. For the absolute operation is not a good analytic operation, so we replace it with the quadratic operation:
RSS means “Residual Sum of Squares”, the sum of total square residual. The model, linear regression, has less RSS, is better than the one that has a larger RSS.
Take eqation(4)(7) into (8):
To minimize the function “”, the calculus told us the possible minimum points always stay at stationary points. And the stationary points are the points where the derivative of the function is zero. Remember that the minimum points must be stationary points, but the stationary point is not necessary to be a minimum point.
Since the ‘’ is a function of and , the derivative is replaced by partial derivative. As the ‘’ for this linear combination is just a simple quadric surface, the minimum or maximum must exist, and there is only one stationary point. Then our mission to find the best parameters for the regression has been converted to calculus the solution of function that the derivative(partial derivative) is set to zero.
THe partial derivative of is
and derivative of is:
Set both of them to zero and we can get:
To get a equation of independently, we take equation(13) to equation(12):
By the way, equation (14) has another form
and they are equal.
Using python to demonstrate our result Equ. (12)(14) is correct:
import numpy as np
After runing the code, we got:
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013. ↩︎