Keywords: least squared error, linear regression
How correct we believe the parameters of the model are always concerned by us. To have more confidence in using methods talked previously, we would like to make a reliable framework, under which the method is always feasible.
We have introduced a naive method to solve the linear regression problem: the relation between the budget of TV and the result of the sale. And, the probability which has been used in the problem is a subject of uncertain problem, so the answer to such sort of problems are also unsure. Then we have to list shreds of evidence to support our result and answer the unfriendly questions, like this:
- “Are you sure about your decision is correct?”
How to get the percentage “95.89%” and how to build evidence to support this number are what we are going to talk about, today.
Before we starting our long mathematical article, a fundamental assumption should be accepted. Both the data we will deal with and all sorts of problems we will come up with have a common model:
where, is produced by an unknown system , but can be observed. And and are both unknown or known incompletely. And is a random variable, who has mean zero, and it can have any distribution, while the normal distribution is usually used because of its simplicity.
Why does this weird equation become our essential condition? The philosophy behind this is our belief in mathematic. We have faith in that the world runs on mathematics that is the function who drives everything around us. However, it exists in an unknown form, we may never know what it is. But we can use some model to approach it as close as we want. This is the basic idea of statistics. On the other hand, the is the mistake we made during the whole procedure, a very easy example is to measure the length of a pen, and your tool, the ruler, maybe not precise. or you write down the wrong number of the result. These kinds of mistakes are always there, we can never get rid of them. So we use a random variable to make the model more practical.
A small conclusion: is the rule behind the world, and it is unknown. is the error made during our observation or recording and it is regarded as a mean zero random variable.
In ‘Simple Linear Regression’, we have solved the problem through a very popular method(almost every course of machine learning put this topic at their beginning), but we would like to go deeper, the first question we come across is how to analysis the estimate of the parameter.
Take our linear regression equation into equation(1), and we get model:
The parameters have their names:
- is the intercept, which means when we set we get
- is called slop, and it represents an average increase of when a one-unit increase in
however, doesn’t have a special name, it catches all noise in the model, for example:
- the original behind the data may be not linear
- measurement may have errors
plays an important part in linear regression, and then we have the second assumption:
is independent of
However, this assumption is distinctly incorrect in the “TV-Sales” problem. We can just recognize that in the linear regression figure:
If we consider the error as the distance between sample points and lines, we can easily find that the error increases, as (TV)increasing. However, we still believe this assumption is reasonable. Because, under this assumption, our regression analysis is easier.
If equation (2) is the exact model of our observation, the line would be called the ‘population regression line’. As we said this equation is not always known, we can only use a regression line to approach it. Among those regression lines, the least-squares line is one of the best linear approaches to the true relationship.
Now we take
as our population regression line, in which has a mean 0 Gaussian distribution. And then 5 samples are generated, each of them contains 20 points( is a random variable, however, we can make it fixed without loss of generality). And each sample is used to fit a line by the least square method. We draw all of them in a single figure, the red line(Generator) is the population regression line and the dushed line is the 5 least-squares line:
This procedure can do again and again, and the least-squares line is just the same as the population regression line is impossible(for it can be modeled by a contiuous random variable, its probability is ). But the expectations of and will be equal to and if we can draw infinite samples. In another world:
This is true because the least square line gives an unbias estimate to the sample. And bias and unbias is an interesting topic in statistics.
For short, if the expectation of the parameter is equal to the original coefficient in equation(2), it is called unbias. Unbias does not mean the estimation is better than the one which is bias, but it has some good properties that the unbias one does not.
The standard error is one of the most useful statistics in parameter estimating. Variance is an essential numerical feature of a random variable, a distribution, and maybe a set of data. Standard deviation is the square root of variance. And the standard deviation of a sample is called a standard error instead. If we want to estimate which is the mean of , we can estimate through . And then we analyze the variance of . That is:
where is the standart deviation of each point of the sample. This is a little confusing, we go back to equantion(3) where is random varible has identity distribution with, and 's randomness comes from but not , that means has Gaussian destribution with mean . The now is the square root of the variance. For now, we consider all the points in the sample is independent, so:
Equation(5) is a very famouse equation, and we can get the following conclusion from this simple equation:
- as increasing, decreases. So becomes more and more certain.
- and is a good measure of how far is to the actual
One sentence about SE: smaller SE, less uncertainness.
There is a hint here, we should sperate a sample point and a realization of it. A sample point is a random variable who has the same distribution as the population. And this is why we can use a sample to estimate parameters of population distribution. SE plays an indispensable part in the following sections.
Then we go back to the little example in last section about the mean of . This can just be the in equation(3) and as well as equation(2). Least square gave us a solution of equation(2) in the post ‘Simple Linear Regression’, by:
where has the same meaning with (Notation: is the mean of population, and is the mean of sample, but here we consider them as the same thing). Take equation(7) into equation(6), we could get of and :
and . This process is little complecated, however, in the multi-variables case, this can be derived more easily. In equation(8), these features can be found:
- When , has the same value as variance of
- When going bigger, square of of and go smaller and both and become more certain.
Both 's contain and is a key numeral property of population distribution, so we would like to estimate from sample. And this estimate is known as the residual standard error, and is given by the formula:
For is alway unknown, and is estimated from the data, should be writen as , but for simplicity of notation we will write .
According confidence interval of and , there is chance that the interval:
will contain and (where is not a precise probability for the interval of )
The question “Are you sure about your decision is correct?” can be answered now. In the interval , we have probability to catch the actual
To answer his question, we introduce the ‘hypothesis test’ into our post. And then, we have two hypotheses:
versus the alternative hypothesis:
And the equivalent mathematical conclusion is " whether is far from or not", because according equation(2), when we have:
and is a random variable with mean . And it has no relationship with at all.
We are not sure about what the actual value of and we can just measure the uncertain by and its . From this view, if if far away from for some kind of distance of . We can have some confidence about weather is 0. Genernally, and have the combination as the table:
|is close to||is far from|
|is large||more likely is||likely to be 0 or not|
|is tiny||likely to be 0 or not||less likely to be 0|
In the figures, we assume has a mean bell-ship distribution whose variance is somehow determined by
This is the first column of the table. When a tiny (the origin line) can make sure has a very high probability to not equal to 0.
In contrast, This is the second column of the table. When a large (the red line) can do nothing to guarantee is not 0.
Then and should be combined to a new form to indicate the chance of the , for this, we present you “t-statistic”
This is called a t-statistic because has a -distribution with degrees of freedom. Equation(11) can be described as “there is numbers of destance from to 0”, and it is reliable distance for it is a relative distance but not an absolute one
The bigger the is, the more unlikely is.
Another generated measurement about this is called a ‘p-value’, who is a famous guy in Statistics. Here it is the value of probability that the value who is larger than . In the figure below, it represents the size of the area of the shadow:
and it can be also denoted as:
According to the realtionship between and how far from to , we have:
- The bigger , the more distance from to
- The smaller , the more distance from to
- the more distance from to , the strong relationship between and
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013. ↩︎