Keywords: least squares
Least Squares for Classification
Least-squares for linear regression had been talked in ‘Simple Linear Regression’. And in this post, we want to find out whether this powerful algorithm can be used in classification.
Recalling the distinction between the properties of classification and regression, two points need to be emphasized again(‘From Linear Regression to Linear Classification’):
- the targets of regression are continuous but the targets of classification are discrete.
- the output of classification hypothesis could be generatively or the output is just a class label discriminatively.
The generative model would be talked in other posts. And we focus on discriminative models in these posts which means our hypothesis directly gives which class the input belongs to.
We want to use least-squares methods that had been designed and proved for linear regression. And what we could do to extend the least-squares method in classification is to:
- modify the type of output
- design a discriminative model
Modifying the type of output is to convert the class label into a number, and assuming the number is also continuous is necessary. And when we use 1-of-K label scheme, we could build the model with linear functions:
where and for . And is the th component of 1-of-K output for . Clearly, the output of each could not be just 0 or 1. And we set the largest value to be 1 and others 0.
We had discussed the linear regression with the least-squares in a ‘single-target’ regression problem. And that idea can also be employed in the multiple targets regression. And these weight vectors can be calculated simultaneously. We can rewrite the equation (1) into the matrix form:
where the th column of is
Then we employ least square method for a smaple:
where is a -dimensional target consisting of 0’s and one ‘1’. And each diminsion of output is the regression result of the corresponding dimension of target . And we build up the input matrix of all input consisting of as rows:
The sum of square error is:
where the matrix is the target matrix whose th row in target vevtor . The trace operation is employed because the only the value for is meaningful, but where and is useless.
To minimize the linear equation in equation(5), we can get its derivative
and set it eqaul to :
where we assume can be inverted.
The component is also called pseudo-inverse of the matrix and it is always denoted as .
The code of this algorithm is relatively simple because we have programmed the linear regression before which has the same form of equation (7).
What we should care about is the formation of these matrices , and .
we should first convert the target value into 1-of-K form:
def label_convert(y, method ='1-of-K'):
what we do is to count the total number of labels()and we set the th component of the 1-of-K target to 1 and other components to 0.
The kernel of the algorithm is:
x = np.c_[np.ones(x_dim), x] is to augment the input vector with a dummy value . And the transpose of the result is to make each row represent a weight vector of eqation (2). The entire project can be found The entire project can be found https://github.com/Tony-Tan/ML and please star me.
I have test the algorithm in several training set, and the result is like the following figures:
- Lack of robustness if outliers (Figure 2 illustrates this problem)
- Sum of squares error penalizes the predictions that are too correct(if an input belongs to label ‘1’ but its output is ‘100’)
- Least squares workes for regression when we assume the target data has a Gaussian distribution and then the least-squares method maximizes the likelihood function. The distribution of targets in these classification tasks is not Gaussian.
Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎