Keywords: linear regression
Accuracy of the Model
An assumption about the model of the observed data is
which means there is an actual generator behind the data and some noise was added during some phases. This is a general model for most regression problems. In linear regression mission, this would specialize as:
In the post ‘simple linear regression problems’, we have mentioned a performance measurement of the linear model, and it is called ‘’ (Residual Sum on Square). Because it has a strong connection with the performance of the linear model, which increases when the linear model fitted data worse. And it is used to direct the parameter searching in the learning phase. This kind of model performance, somehow, might be regarded as a way to define the accuracy of the model:
- Lower , better model
- Higher , worse model
Beside , in this post, we take out another two numeral properties of the accuracy of the linear model.
is a good tool to assess acuracy to model, but it also has deficiency. is a derivative of :
Factor in came from that there are 2 parameters in our simple linear regression model. And if there are parameters in our model, this factor will be changed to :
Why is a better choice than ?
One of the assumption of equation(1) is that is random variable whose has Gaussian distribution with mean . Then every also has a Gaussian distribution with mean and variance . In another world:
And it is the essential observation to estimate the real . And all 's are i.i.d, which means they have the same . And can be a statistic to predict . To make an unbias estimation of standard deviation , is converted to .
In other words, when we have the actual parameters, can be used as an unbias estimate of the standard deviation of in equation (1) ( has a distribution). But, when we don’t have the correct parameters, has the same effect of .
measures the lake of fitting:
- when , we have a small – good fit
- when , we have a big – bad fit
Consider the following situation: if the value of is among , the of our model may be bigger than 1 million. And another model of the same data, but using the conversion, like , the , of the new model may lie below 100. Then which model is good can not be told just by . We need a relative measure but not an absolute measure.
A traditional way of changing absolute measure to a relative one is building a propotion. So we get:
where is total sum of squares . To , we can draw the as a line through the data:
where gray line is the deviation of each point, their length sum up to . While after linear fitting, we got :
is alway greater or equal to :
where only when and is the reduction of uncertain after fitting the model.
This gives us the following conclusion:
- When , , that means a perfect fitting and the model explain everything
- When , , this means the fitting does nothing for predicting, because its prediction is just the same as the mean of the sample.
So, is a good measurement for the accuracy of the model.
Usage of is not just an indicator of how good a model is, but it can be used in any way. The first one, for instance, in Physics, we know has a linear relationship, but after regression is close to 0. So, this gives information that data has too much noise, or even the experiment is totally wrong.
The second example is using to test whether the sample has a linear relationship. What we need to do is do linear regression and then test its , if , which means they are not linear.
We test if the linear relationship is strong by , this is a little like we test if there was some relationship between two variables by covariance. And is a linear transform of covariance of and . However we use instead of because in multi-variables case works easier.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013. ↩︎