Assessing the Accuracy of the Model

Keywords: linear regression

Accuracy of the Model[1]

An assumption about the model of the observed data is

Y=f(X)+ε(1)Y=f(X)+\varepsilon\tag{1}

which means there is an actual generator behind the data and some noise was added during some phases. This is a general model for most regression problems. In linear regression mission, this would specialize as:

Y=w1X+w0+ε(2)Y=w_1X+w_0+\varepsilon\tag{2}

In the post ‘simple linear regression problems’, we have mentioned a performance measurement of the linear model, and it is called ‘RSS\text{RSS}’ (Residual Sum on Square). Because it has a strong connection with the performance of the linear model, which increases when the linear model fitted data worse. And it is used to direct the parameter searching in the learning phase. This kind of model performance, somehow, might be regarded as a way to define the accuracy of the model:

  • Lower RSS\text{RSS}, better model
  • Higher RSS\text{RSS}, worse model

Beside RSS\text{RSS}, in this post, we take out another two numeral properties of the accuracy of the linear model.

  1. RSE\text{RSE}
  2. R2\text{R}^2

Residual Standard Error(RSE\text{RSE})

RSS\text{RSS} is a good tool to assess acuracy to model, but it also has deficiency. RSE\text{RSE} is a derivative of RSS\text{RSS}:

RSE=1n2RSS=1n2i=1n(yiyi^)2(3)\text{RSE}=\sqrt{\frac{1}{n-2}\text{RSS}}=\sqrt{\frac{1}{n-2}\sum^n_{i=1}(y_i-\hat{y_i})^2}\tag{3}

Factor 22 in n2n-2 came from that there are 2 parameters in our simple linear regression model. And if there are mm parameters in our model, this factor will be changed to nmn-m:

RSE=1nmRSS=1nmi=1n(yiyi^)2(4)\text{RSE}=\sqrt{\frac{1}{n-m}\text{RSS}}=\sqrt{\frac{1}{n-m}\sum^n_{i=1}(y_i-\hat{y_i})^2}\tag{4}

Why RSE\text{RSE} is a better choice than RSE\text{RSE}?

One of the assumption of equation(1) is that ε\varepsilon is random variable whose has Gaussian distribution with mean 00 . Then every YiY_i also has a Gaussian distribution with mean w1Xi+w0w_1X_i+w_0 and variance δ2\delta^2. In another world:

δi2=(yi^yi)2\delta_i^2=(\hat{y_i}-y_i)^2

And it is the essential observation to estimate the real δ2\delta^2. And all XiX_i's are i.i.d, which means they have the same δ2\delta^2. And RSS\text{RSS} can be a statistic to predict δ2\delta^2. To make an unbias estimation of standard deviation δ\delta, RSS\text{RSS} is converted to RSE\text{RSE}.

In other words, when we have the actual parameters, RSE\text{RSE} can be used as an unbias estimate of the standard deviation of δ\delta in equation (1) (δi2=(yi^yi)2\delta_i^2=(\hat{y_i}-y_i)^2 has a χ2\chi^2 distribution). But, when we don’t have the correct parameters, RSE\text{RSE} has the same effect of RSS\text{RSS}.

RSE\text{RSE} measures the lake of fitting:

  • when yi^yi\hat{y_i}\sim y_i, we have a small RSE\text{RSE} – good fit
  • when yi^ far from yi\hat{y_i} \text{ far from } y_i, we have a big RSE\text{RSE} – bad fit

R2\text{R}^2

Consider the following situation: if the value of XiX_i is among [109,1010][10^9,10^{10}], the RSE\text{RSE} of our model may be bigger than 1 million. And another model of the same data, but using the conversion, like log(Xi)\log(X_i), the RSE\text{RSE}, of the new model may lie below 100. Then which model is good can not be told just by RSE\text{RSE}. We need a relative measure but not an absolute measure.

A traditional way of changing absolute measure to a relative one is building a propotion. So we get:

R2=TSSRSSTSS=1RSSTSS\text{R}^2=\frac{\text{TSS}-\text{RSS}}{\text{TSS}}=1-\frac{\text{RSS}}{\text{TSS}}

where TSS\text{TSS} is total sum of squares (yiyˉ)2\sum(y_i-\bar{y})^2. To TSS\text{TSS}, we can draw the yˉ\bar{y} as a line through the data:

where gray line is the deviation of each point, their length sum up to TSS\text{TSS}. While after linear fitting, we got RSS\text{RSS}:

TSS\text{TSS} is alway greater or equal to RSS\text{RSS}:

TSS=(yiyˉ)2=(yiyi^+yi^yˉ)2=(yiyi^)2+(yi^yˉ)2+2(yiyi^)(yi^yˉ)=RSS+(yi^yˉ)2+2(yiyi^)(yi^yˉ)RSS0\begin{aligned} \text{TSS}&=\sum(y_i-\bar{y})^2\\ &=\sum(y_i-\hat{y_i}+\hat{y_i}-\bar{y})^2\\ &=\sum(y_i-\hat{y_i})^2+\sum(\hat{y_i}-\bar{y})^2+2\sum(y_i-\hat{y_i})\sum(\hat{y_i}-\bar{y})\\ &=\text{RSS}+\sum(\hat{y_i}-\bar{y})^2+2\sum(y_i-\hat{y_i})\sum(\hat{y_i}-\bar{y})\\ &\geq \text{RSS}\geq 0 \end{aligned}

where TSS=RSS\text{TSS}=\text{RSS} only when yi^=yˉ for all possible i\hat{y_i}=\bar{y}\text{ for all possible } i and TSSRSS\text{TSS}-\text{RSS} is the reduction of uncertain after fitting the model.

This gives us the following conclusion:

  1. 0R210 \leq \text{R}^2 \leq 1
  2. When R2=1\text{R}^2 = 1, RSS=0\text{RSS}=0, that means a perfect fitting and the model explain everything
  3. When R2=0\text{R}^2 = 0, RSS=TSS\text{RSS}=\text{TSS}, this means the fitting does nothing for predicting, because its prediction is just the same as the mean of the sample.

So, R2\text{R}^2 is a good measurement for the accuracy of the model.

Usage of R2\text{R}^2

Usage of R2\text{R}^2 is not just an indicator of how good a model is, but it can be used in any way. The first one, for instance, in Physics, we know X,YX, Y has a linear relationship, but R2\text{R}^2 after regression is close to 0. So, this R2\text{R}^2 gives information that data has too much noise, or even the experiment is totally wrong.

The second example is using R2\text{R}^2 to test whether the sample has a linear relationship. What we need to do is do linear regression and then test its R2\text{R}^2, if R20\text{R}^2\to 0, which means they are not linear.

Cov(X,Y)\text{Cov}(X,Y) and R2\text{R}^2

We test if the linear relationship is strong by R2\text{R}^2, this is a little like we test if there was some relationship between two variables by covariance. And R2\text{R}^2 is a linear transform of covariance of XX and YY. However we use R2\text{R}^2 instead of Cov(X,Y)\text{Cov}(X,Y) because in multi-variables case R2\text{R}^2 works easier.

References


  1. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013. ↩︎