Are the Parameters Calculated from Least Squared Error Correct?

Keywords: least squared error, linear regression

How correct we believe the parameters of the model are always concerned by us. To have more confidence in using methods talked previously, we would like to make a reliable framework, under which the method is always feasible.

Are you sure about your decision is correct?

We have introduced a naive method to solve the linear regression problem: the relation between the budget of TV and the result of the sale.[1] And, the probability which has been used in the problem is a subject of uncertain problem, so the answer to such sort of problems are also unsure. Then we have to list shreds of evidence to support our result and answer the unfriendly questions, like this:

  • “Are you sure about your decision is correct?”
  • “95.89%”

How to get the percentage “95.89%” and how to build evidence to support this number are what we are going to talk about, today.


Before we starting our long mathematical article, a fundamental assumption should be accepted. Both the data we will deal with and all sorts of problems we will come up with have a common model:


where, yy is produced by an unknown system f(x)+εf(x)+\varepsilon, but yy can be observed. And f()f() and xx are both unknown or known incompletely. And ε\varepsilon is a random variable, who has mean zero, and it can have any distribution, while the normal distribution is usually used because of its simplicity.

Why does this weird equation become our essential condition? The philosophy behind this is our belief in mathematic. We have faith in that the world runs on mathematics that is the function f(x)f(x) who drives everything around us. However, it exists in an unknown form, we may never know what it is. But we can use some model to approach it as close as we want. This is the basic idea of statistics. On the other hand, the ε\varepsilon is the mistake we made during the whole procedure, a very easy example is to measure the length of a pen, and your tool, the ruler, maybe not precise. or you write down the wrong number of the result. These kinds of mistakes are always there, we can never get rid of them. So we use a random variable ε\varepsilon to make the model more practical.

A small conclusion: f(x)f(x) is the rule behind the world, and it is unknown. ε\varepsilon is the error made during our observation or recording and it is regarded as a mean zero random variable.

Linear f(x)f(x)

In ‘Simple Linear Regression’, we have solved the problem through a very popular method(almost every course of machine learning put this topic at their beginning), but we would like to go deeper, the first question we come across is how to analysis the estimate of the parameter.

Take our linear regression equation into equation(1), and we get model:


The parameters have their names:

  • w0w_0 is the intercept, which means when we set x=0x=0 we get y=w0y=w_0
  • w1w_1 is called slop, and it represents an average increase of yy when a one-unit increase in xx

however, ε\varepsilon doesn’t have a special name, it catches all noise in the model, for example:

  1. the original f(x)f(x) behind the data may be not linear
  2. measurement may have errors

ε\varepsilon plays an important part in linear regression, and then we have the second assumption:

ε\varepsilon is independent of xx

However, this assumption is distinctly incorrect in the “TV-Sales” problem. We can just recognize that in the linear regression figure:

If we consider the error as the distance between sample points and lines, we can easily find that the error increases, as XX (TV)increasing. However, we still believe this assumption is reasonable. Because, under this assumption, our regression analysis is easier.

If equation (2) is the exact model of our observation, the line f()f(\cdot) would be called the ‘population regression line’. As we said this equation is not always known, we can only use a regression line to approach it. Among those regression lines, the least-squares line is one of the best linear approaches to the true relationship.

Relationship between Population Regression Line and Least Squares Lines

Now we take

Y=2.3x+1+ε(3)Y=2.3x+1+\varepsilon \tag{3}

as our population regression line, in which ε\varepsilon has a mean 0 Gaussian distribution. And then 5 samples are generated, each of them contains 20 points(xx is a random variable, however, we can make it fixed without loss of generality). And each sample is used to fit a line by the least square method. We draw all of them in a single figure, the red line(Generator) is the population regression line and the dushed line is the 5 least-squares line:

This procedure can do again and again, and the least-squares line is just the same as the population regression line is impossible(for it can be modeled by a contiuous random variable, its probability is 00). But the expectations of w0^\hat{w_0} and w1^\hat{w_1} will be equal to 11 and 2.32.3 if we can draw infinite samples. In another world:

E(w0^)=number of samplesw0E(w1^)=number of samplesw1(4)\begin{aligned} \mathbb{E}(\hat{w_0})\stackrel{\text{number of samples}\to \infty}{=}w_0\\ \mathbb{E}(\hat{w_1})\stackrel{\text{number of samples}\to \infty}{=}w_1 \end{aligned}\tag{4}

This is true because the least square line gives an unbias estimate to the sample. And bias and unbias is an interesting topic in statistics.

For short, if the expectation of the parameter is equal to the original coefficient(w0,w1)(w_0,w_1) in equation(2), it is called unbias. Unbias does not mean the estimation is better than the one which is bias, but it has some good properties that the unbias one does not.

Standard Error

The standard error is one of the most useful statistics in parameter estimating. Variance is an essential numerical feature of a random variable, a distribution, and maybe a set of data. Standard deviation is the square root of variance. And the standard deviation of a sample is called a standard error instead. If we want to estimate μ\mu which is the mean of yy, we can estimate μ\mu through μ^=1nnyi\hat{\mu}=\frac{1}{n}\sum^ny_i. And then we analyze the variance of μ\mu. That is:


where deltadelta is the standart deviation of each point of the sample. This is a little confusing, we go back to equantion(3) where xix_i is random varible has identity distribution withxx, and yiy_i's randomness comes from ε\varepsilon but not xix_i, that means yiy_i has Gaussian destribution with mean w1Xi+w0w_1X_i+w_0. The δ\delta now is the square root of the variance. For now, we consider all the points in the sample is independent, so:

var(μ^)=var(1nnyi)=1n2var(nyi)=1n2nvar(yi)=nδ2n2=δ2n(6)\begin{aligned} \text{var}(\hat{\mu})&=\text{var}(\frac{1}{n}\sum^n y_i)\\ &=\frac{1}{n^2}\text{var}(\sum^ny_i)\\ &=\frac{1}{n^2}\sum^n\text{var}(y_i)\\ &=\frac{n\delta^2}{n^2}=\frac{\delta^2}{n} \end{aligned}\tag{6}

Equation(5) is a very famouse equation, and we can get the following conclusion from this simple equation:

  1. as nn increasing, var(μ^)\text{var}(\hat{\mu}) decreases. So μ^\hat{\mu} becomes more and more certain.
  2. var(μ^)\text{var}(\hat{\mu}) and SE(μ^)2\text{SE}(\hat{\mu})^2 is a good measure of how far μ^\hat{\mu} is to the actual μ\mu

One sentence about SE: smaller SE, less uncertainness.

There is a hint here, we should sperate a sample point and a realization of it. A sample point is a random variable who has the same distribution as the population. And this is why we can use a sample to estimate parameters of population distribution. SE plays an indispensable part in the following sections.

Use Standard Error(SE) to Solve the Question

“How sure about the estimation of w1w_1 and w0w_0?”

Then we go back to the little example in last section about the mean of yy. This yy can just be the yy in equation(3) and as well as equation(2). Least square gave us a solution of equation(2) in the post ‘Simple Linear Regression’, by:

w0^=yˉw1^xˉw1^=i=1nxi(yiyˉ)i=1nxi(xixˉ)=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)(xixˉ)(7)\begin{aligned} \hat{w_0}&=\bar{y}-\hat{w_1}\bar{x}\\ \hat{w_1}&=\frac{\sum_{i=1}^nx_i(y_i-\bar{y})}{\sum_{i=1}^nx_i(x_i-\bar{x})}=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})(x_i-\bar{x})} \end{aligned}\tag{7}

where yˉ\bar{y} has the same meaning with μ^\hat{\mu}(Notation: Yˉ\bar{Y} is the mean of population, and yˉ\bar{y} is the mean of sample, but here we consider them as the same thing). Take equation(7) into equation(6), we could get SE\text{SE} of w0w_0 and w1w_1:

SE(w0^)2=δ2[1n+xˉ2i=1n(xixˉ)2]SE(w1^)2=δ2i=1n(xixˉ)2(8)\begin{aligned} \text{SE}(\hat{w_0})^2&=\delta^2[\frac{1}{n}+\frac{\bar{x}^2}{\sum^{n}_{i=1}(x_i-\bar{x})^2}]\\ \text{SE}(\hat{w_1})^2&=\frac{\delta^2}{\sum^{n}_{i=1}(x_i-\bar{x})^2} \end{aligned}\tag{8}

and δ2=var(ε)\delta^2=\text{var}(\varepsilon). This process is little complecated, however, in the multi-variables case, this can be derived more easily. In equation(8), these features can be found:

  1. When xˉ=0\bar{x}=0, SE(w0^)2\text{SE}(\hat{w_0})^2 has the same value as variance of yy SE(w0^)2=δ2n\text{SE}(\hat{w_0})^2=\frac{\delta^2}{n}
  2. When i=1n(xixˉ)2\sum^{n}_{i=1}(x_i-\bar{x})^2 going bigger, square of SE\text{SE} of w0w_0 and w1w_1 go smaller and both w0w_0 and w1w_1 become more certain.

Both SE\text{SE}'s contain δ2\delta^2 and δ2\delta^2 is a key numeral property of population distribution, so we would like to estimate δ2\delta^2 from sample. And this estimate is known as the residual standard error, and is given by the formula:


For δ2\delta^2 is alway unknown, and is estimated from the data, SE\text{SE} should be writen as SE^\hat{\text{SE}}, but for simplicity of notation we will write SE\text{SE}.

According confidence interval of w0w_0 and w1w_1, there is 95%95\% chance that the interval:



will contain w0w_0 and w1w_1(where 95%95\% is not a precise probability for the interval of ±2SE\pm2\text{SE})

The question “Are you sure about your decision is correct?” can be answered now. In the interval w1^±2SE\hat{w_1}\pm2\text{SE}, we have 95%95\% probability to catch the actual w1w_1

“Is there a relationship between xx and yy?”

To answer his question, we introduce the ‘hypothesis test’ into our post. And then, we have two hypotheses:

H0: There is no relationship betweenx and yH_0\text{: There is no relationship between} x \text{ and } y

versus the alternative hypothesis:

Ha: There is relationship betweenx and yH_a\text{: There is relationship between} x \text{ and } y

And the equivalent mathematical conclusion is " whether w1w_1 is far from 00 or not", because according equation(2), when w1=0w_1=0 we have:


and yy is a random variable with mean w0w_0. And it has no relationship with xx at all.

We are not sure about what the actual value of w1w_1 and we can just measure the uncertain w1w_1 by w1^\hat{w_1} and its SE(w1^)\text{SE}(\hat{w_1}). From this view, if w1^\hat{w_1} if far away from 00 for some kind of distance of SE\text{SE}. We can have some confidence about weather w1^\hat{w_1} is 0. Genernally, w1^\hat{w_1} and SEw1^\text{SE}{\hat{w_1}} have the combination as the table:

w1^\hat{w_1} is close to 00 w1^\hat{w_1} is far from 00
SE(w1^)\text{SE}(\hat{w_1}) is large more likely is 00 likely to be 0 or not
SE(w1^)\text{SE}(\hat{w_1}) is tiny likely to be 0 or not less likely to be 0

In the figures, we assume w1w_1 has a mean w1^\hat{w_1} bell-ship distribution whose variance is somehow determined by SE(w1^)\text{SE}(\hat{w_1})

This is the first column of the table. When (^w1)=0.1\hat(w_1)=0.1 a tiny SE\text{SE}(the origin line) can make sure w1w_1 has a very high probability to not equal to 0.

In contrast, This is the second column of the table. When (^w1)=1\hat(w_1)=1 a large SE\text{SE}(the red line) can do nothing to guarantee w1w_1 is not 0.


Then w1^\hat{w_1} and SE\text{SE} should be combined to a new form to indicate the chance of the w1=0w_1=0, for this, we present you “t-statistic


This is called a t-statistic because tt has a tt-distribution with n2n-2 degrees of freedom. Equation(11) can be described as “there is tt numbers of SE(w1)\text{SE}(w_1) destance from w1^\hat{w_1} to 0”, and it is reliable distance for it is a relative distance but not an absolute one

The bigger the tt is, the more unlikely w1^=0\hat{w_1}=0 is.


Another generated measurement about this is called a ‘p-value’, who is a famous guy in Statistics. Here it is the value of probability that the value who is larger than t|t|. In the figure below, it represents the size of the area of the shadow:

and it can be also denoted as:


According to the realtionship between tt and how far from w1^\hat{w_1} to 00, we have:

  • The bigger tt, the more distance from w1^\hat{w_1} to 00
  • The smaller p-value\text{p-value}, the more distance from w1^\hat{w_1} to 00
  • the more distance from w1^\hat{w_1} to 00, the strong relationship between XX and yy


  1. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013. ↩︎