Least Squares in Classification

Keywords: least squares

Least Squares for Classification[1]

Least-squares for linear regression had been talked in ‘Simple Linear Regression’. And in this post, we want to find out whether this powerful algorithm can be used in classification.

Recalling the distinction between the properties of classification and regression, two points need to be emphasized again(‘From Linear Regression to Linear Classification’):

  1. the targets of regression are continuous but the targets of classification are discrete.
  2. the output of classification hypothesis could be P(Ckx)\mathbb{P}(\mathcal{C}_k|\boldsymbol{x}) generatively or the output is just a class label Ck\mathcal{C}_k discriminatively.

The generative model would be talked in other posts. And we focus on discriminative models in these posts which means our hypothesis directly gives which class the input belongs to.

We want to use least-squares methods that had been designed and proved for linear regression. And what we could do to extend the least-squares method in classification is to:

  1. modify the type of output
  2. design a discriminative model

Modifying the type of output is to convert the class label into a number, and assuming the number is also continuous is necessary. And when we use 1-of-K label scheme, we could build the model with KK linear functions:

y1(x)=w1Txy2(x)=w2TxyK(x)=wKTx(1)\begin{aligned} y_1(\boldsymbol{x})&=\boldsymbol{w}^T_1\boldsymbol{x}\\ y_2(\boldsymbol{x})&=\boldsymbol{w}^T_2\boldsymbol{x}\\ \vdots&\\ y_K(\boldsymbol{x})&=\boldsymbol{w}^T_K\boldsymbol{x}\\ \end{aligned}\tag{1}

where x=[1x1x2xn]\boldsymbol{x}=\begin{bmatrix}1\\x_1\\x_2\\\vdots\\x_n\end{bmatrix} and wi=[w0w1w2wn]\boldsymbol{w}_i=\begin{bmatrix}w_0\\w_1\\w_2\\\vdots\\w_n\end{bmatrix} for i=1,2,,Ki=1,2,\cdots,K. And yiy_i is the iith component of 1-of-K output for i=1,2,,Ki=1,2,\cdots,K. Clearly, the output of each yi(x)y_i(\boldsymbol{x}) could not be just 0 or 1. And we set the largest value to be 1 and others 0.

We had discussed the linear regression with the least-squares in a ‘single-target’ regression problem. And that idea can also be employed in the multiple targets regression. And these KK weight vectors wi\boldsymbol{w}_i can be calculated simultaneously. We can rewrite the equation (1) into the matrix form:


where the iith column of WW is wi\boldsymbol{w}_i

Then we employ least square method for a smaple:

{(x1,t1),(x2,t2),,(xm,tm)}(3)\{(\boldsymbol{x}_1,\boldsymbol{t}_1),(\boldsymbol{x}_2,\boldsymbol{t}_2),\cdots,(\boldsymbol{x}_m,\boldsymbol{t}_m)\} \tag{3}

where t\boldsymbol{t} is a KK-dimensional target consisting of k1k-1 0’s and one ‘1’. And each diminsion of output y(x)i\boldsymbol{y}(\boldsymbol{x})_i is the regression result of the corresponding dimension of target tit_i. And we build up the input matrix XX of all mm input consisting of xT\boldsymbol{x}^T as rows:

X=[x1Tx2TxKT](4)X=\begin{bmatrix} -&\boldsymbol{x}^T_1&-\\ -&\boldsymbol{x}^T_2&-\\ &\vdots&\\ -&\boldsymbol{x}^T_K&- \end{bmatrix}\tag{4}

The sum of square error is:

E(W)=12Tr{(XWT)T(XWT)}(5)E(W)=\frac{1}{2}\mathrm{Tr}\{(XW-T)^T(XW-T)\} \tag{5}

where the matrix TT is the target matrix whose iith row in target vevtor tiT\boldsymbol{t}^T_i. The trace operation is employed because the only the value (WxiTti)T(WxiTti)(W\boldsymbol{x}^T_i-\boldsymbol{t}_i)^T(W\boldsymbol{x}_i^T-\boldsymbol{t}_i) for i=1,2,,mi=1,2,\cdots,m is meaningful, but (WxiTti)T(WxjTtj)(W\boldsymbol{x}^T_i-\boldsymbol{t}_i)^T(W\boldsymbol{x}_j^T-\boldsymbol{t}_j) where iji\neq j and i,j=1,2,,mi,j = 1,2,\cdots,m is useless.

To minimize the linear equation in equation(5), we can get its derivative

dE(W)dW=ddW(12Tr{(XWT)T(XWT)})=12ddW(Tr{WTXTXWTTXWWTXTT+TTT})=12ddW(Tr{WTXTXW}Tr{TTXW}Tr{WTXTT}+Tr{TTT})=12ddW(Tr{WTXTXW}2Tr{TTXW}+Tr{TTT})=12(XTXWXTT)(6)\begin{aligned} \frac{dE(W)}{dW}&=\frac{d}{dW}(\frac{1}{2}\mathrm{Tr}\{(XW-T)^T(XW-T)\})\\ &=\frac{1}{2}\frac{d}{dW}(\mathrm{Tr}\{W^TX^TXW-T^TXW-W^TX^TT+T^TT\})\\ &=\frac{1}{2}\frac{d}{dW}(\mathrm{Tr}\{W^TX^TXW\}-\mathrm{Tr}\{T^TXW\}\\ &-\mathrm{Tr}\{W^TX^TT\}+\mathrm{Tr}\{T^TT\})\\ &=\frac{1}{2}\frac{d}{dW}(\mathrm{Tr}\{W^TX^TXW\}-2\mathrm{Tr}\{T^TXW\}+\mathrm{Tr}\{T^TT\})\\ &=\frac{1}{2}(X^TXW-X^TT) \end{aligned}\tag{6}

and set it eqaul to 0\boldsymbol{0}:

12(XTXWXTT)=0W=(XTX)1XTT(7)\begin{aligned} \frac{1}{2}(X^TXW-X^TT )&= \boldsymbol{0}\\ W&=(X^TX)^{-1}X^TT \end{aligned}\tag{7}

where we assume XTXX^TX can be inverted.
The component (XTX)1XT(X^TX)^{-1}X^T is also called pseudo-inverse of the matrix XX and it is always denoted as XX^{\dagger}.

Code and Result

The code of this algorithm is relatively simple because we have programmed the linear regression before which has the same form of equation (7).

What we should care about is the formation of these matrices WW, XX and TT.

we should first convert the target value into 1-of-K form:

def label_convert(y, method ='1-of-K'):
if method == '1-of-K':
label_dict = {}
number_of_label = 0
for i in y:
if i not in label_dict:
label_dict[i] = number_of_label
number_of_label += 1
y_ = np.zeros([len(y),number_of_label])
for i in range(len(y)):
y_[i][label_dict[y[i]]] = 1
return y_,number_of_label

what we do is to count the total number of labels(KK)and we set the iith component of the 1-of-K target to 1 and other components to 0.

The kernel of the algorithm is:

class LinearClassifier():
def least_square(self, x, y):
x = np.array(x)
x_dim = x.shape[0]
x = np.c_[np.ones(x_dim), x]
w = np.linalg.pinv(x.transpose().dot(x)).dot(x.transpose()).dot(y)
return w.transpose()

the line x = np.c_[np.ones(x_dim), x] is to augment the input vector x\boldsymbol{x} with a dummy value 11. And the transpose of the result is to make each row represent a weight vector of eqation (2). The entire project can be found The entire project can be found https://github.com/Tony-Tan/ML and please star me.

I have test the algorithm in several training set, and the result is like the following figures:

Problems of Least Squares

  1. Lack of robustness if outliers (Figure 2 illustrates this problem)
  2. Sum of squares error penalizes the predictions that are too correct(if an input belongs to label ‘1’ but its output is ‘100’)
  3. Least squares workes for regression when we assume the target data has a Gaussian distribution and then the least-squares method maximizes the likelihood function. The distribution of targets in these classification tasks is not Gaussian.


  1. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎