From Linear Regression to Linear Classification

Keywords: linear classification

Recall Linear Regression[1]

In the posts ‘Introduction to Linear Regression’, ‘Simple Linear Regression’ and ‘Polynomial Regression and Features-Extension of Linear Regression’, we had discussed the regression task. The goal of regression is to find out a function or hypothesis that given an input x\boldsymbol{x}, the hypothesis can make a prediction y^\hat{y} which should be as close to the target yy as possible. And both the target yy and prediction y^\hat{y} are contiuous. They have the properties of numbers:

Considering 3 inputs x1\boldsymbol{x}_1, x2\boldsymbol{x}_2 and x3\boldsymbol{x}_3 and their coresponding targets y1=0y_1=0, y2=1y_2=1 and y3=2y_3=2. Then a good predictor should give the predictions y^1\hat{y}_1, y^2\hat{y}_2 and y^3\hat{y}_3 where the distance between y^1\hat{y}_1 and y^2\hat{y}_2 is larger than the one between y^1\hat{y}_1 and y^3\hat{y}_3

Some properties of linear regression or other regression tasks we should pay attention:

  1. The goal of regression is to produce a hypothesis that can give a prediction as close to the target as possible
  2. The output of the hypothesis and target are continuous numbers and have numerical meanings, like distance, order and so on.

General Classification

On the other side, we met more classification tasks in our life than regression. Such as in the supermarket we can tell the apple and the orange apart easily. And we can even verify this apple is tasty or not.

Then the goal of classification is clear:

Assign input x\boldsymbol{x} to a certain class of KK classes. And x\boldsymbol{x} belongs to one and only one class.

The input x\boldsymbol{x}, like the input of regression, can be a feature or basis function of feature and can be continuous or discrete. However, its output is different from that of regression. Let’s go back to the example that we can tell apple, orange, and pineapple apart. The difference between apple and orange and the difference between apple and pineapple can not be compared, for the distance(it is the mathematical name of difference) itself had no means.

In the mathematical models, apple and orange can not be calculated. So a usual step in classification task is mapping the target or labels of an example into a number, like 11 for the apple and 22 for the orange.

A binary code scheme is another way to code targets. To a two classes mission, the numerical labels can be:

C1=0 and C2=1(1)\mathcal{C}_1=0 \text{ and }\mathcal{C}_2=1\tag{1}

It’s equal to:

C1=1 and C2=0(2)\mathcal{C}_1=1 \text{ and }\mathcal{C}_2=0\tag{2}

and to a KK classes target, the binary code scheme is:

C1={1,0,,0}C2={0,1,,0}CK={0,0,,1}(3)\begin{aligned} \mathcal{C}_1 &= \{1,0,\cdots,0\}\\ \mathcal{C}_2 &= \{0,1,\cdots,0\}\\ \vdots & \\ \mathcal{C}_K &= \{0,0,\cdots,1\}\\ \end{aligned}\tag{3}

The nn-dimensional input xRn\boldsymbol{x}\in \mathbb{R}^n and Rn\mathbb{R}^n is called the input space. In the classification task, the input points can be separated by the targets, and these parts of space are called decision regions and the boundaries between decision regions are called decision boundaries or decision surfaces. When the decision boundary is linear, the task is called linear classification.

There are roughly two kinds of procedure of classification:

  1. Discriminant Function: assign input x\boldsymbol{x} to a certain class directly.
  2. We infer P(Ckx)\mathbb{P}(\mathcal{C}_k|\boldsymbol{x}) firstly and then make a decision basing on the posterior probability.
    1. Inference of P(Ckx)\mathbb{P}(\mathcal{C}_k|\boldsymbol{x}) can be calculated directly
    2. P(Ckx)\mathbb{P}(\mathcal{C}_k|\boldsymbol{x}) can also be calculated by Bayesian Theorem P(Ckx)=P(xCk)P(Ck)P(x)=P(xCk)P(Ck)kP(xCk)P(Ck)\mathbb{P}(\mathcal{C}_k|\boldsymbol{x})=\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_k)\mathbb{P}(\mathcal{C}_k)}{\mathbb{P}(\boldsymbol{x})}=\frac{\mathbb{P}(\boldsymbol{x}|\mathcal{C}_k)\mathbb{P}(\mathcal{C}_k)}{\sum_k \mathbb{P}(\boldsymbol{x}|\mathcal{C}_k)\mathbb{P}(\mathcal{C}_k)}

They are discriminate model and generative model, respectively.

Linear Classification

In the regression problem, the output of the linear function:


is an approximation of target. But in the classification task, we want the output is the class to which the input x\boldsymbol{x} belongs. However, the output of the linear function is always continuous. Then we wish this value can represent the probability of a given class but not the class label, say P(Cix)\mathbb{P}({\mathcal{C}_i|\boldsymbol{x}}). To achieve this, function f()f(\cdot) which is called ‘action function’ in machine learning and ‘link function’ in statistical learning is introduced. For example, we can choose a threshold function as the active function:


where f()f(\cdot) is the threshold function. In this case, the boundary is f(wTx+b)= constant f(\boldsymbol{w}^T\boldsymbol{x}+b) = \text{ constant }, and so wTx+b= constant \boldsymbol{w}^T\boldsymbol{x}+b = \text{ constant } is a line. Technically, f(wTx+b)f(\boldsymbol{w}^T\boldsymbol{x}+b) is no more linear. It is a generalized linear model. By the way, the input x\boldsymbol{x} can be replaced by a basis function ϕ(x)\phi(\boldsymbol{x}) as in the polynomial regression.


  1. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎