Logistic Regression

Keywords: logistic regression, cross-entropy

Idea of logistic regression[1]

Logistic sigmoid function(logistic function for short) had been introduced in post ‘An Introduction to Probabilistic Generative Models for Linear Classification’. It has an elegant form:

δ(a)=11+ea(1)\delta(a)=\frac{1}{1+e^{-a}}\tag{1}

and when a=0a=0, δ(a)=12\delta(a)=\frac{1}{2} and this is just the half of the range of logistic function. This gives us a strong implication that we can set aa equals to some functions y(x)y(\boldsymbol{x}), and then

a=y(x)=0(2)a=y(\boldsymbol{x})=0\tag{2}

becomes a decision boundary. Here the logistic function plays the same role as the threshold function described in the post ‘From Linear Regression to Linear Classification’

Logistic Regression of Linear Classification

The easiest function is a constant which corresponds to a 1-deminsional input. However, the dicision boundary of 1-deminsional input is a degenerated line, namely, a point. So we consider a little complex function - a 2-deminsional input vector [x1  x2]T[x_1\;x_2]^T and the function y(x)y(\boldsymbol{x}) is:

y(x)=w0+w1x1+w2x2=wTx=[w0w1w2][1x1x2](3)y(\boldsymbol{x})=w_0+w_1x_1+w_2x_2=\boldsymbol{w}^T\boldsymbol{x}= \begin{bmatrix}w_0&w_1&w_2\end{bmatrix} \begin{bmatrix} 1\\ x_1\\ x_2 \end{bmatrix}\tag{3}

Then we substitute this into equation (1), we got our linear logsitic regression function:

y(x)=δ(wTx)=11+ewTx(4)y(\boldsymbol{x})=\delta(\boldsymbol{w}^T\boldsymbol{x})=\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x}}}\tag{4}

The output of the function (4) is just 1-dimensional and its range is 00 to 11. So it can be used to represent a probability of the input belongs to C1\mathcal{C}_1 whose label is 11 or C0\mathcal{C}_0 whose label is 00 in the training set. In this form, we can just deal with a two-class mission because we have only one decision boundary.

Estimating the Parameters in Logistic Regression

Although logistic regression is called regression, it acts as a classifier. Our mission, now, is to estimate the parameters in equation(4).

Recalling that the output of equation(4) is P(C1x,M)\mathbb{P}(\mathcal{C}_1|\boldsymbol{x},M) where MM is the model we selected. And the model sometimes can be represented by its parameters. And the mission should you chose to accept it, is to build probability P(wx,t)\mathbb{P}(\boldsymbol{w}|\boldsymbol{x},t) where x\boldsymbol{x} is the input vector and t{0,1}t\in\{0,1\} is the corresponding label and condition x\boldsymbol{x} is always been omitted. tt is one of C1\mathcal{C}_1 or C2\mathcal{C}_2, so the Bayesian relation of P(wt)\mathbb{P}(\boldsymbol{w}|t) is:

P(wt)=P(tw)P(w)P(t)=P(Ciw)P(w)P(t)=P(Cix,M)P(w)P(t)(5)\mathbb{P}(\boldsymbol{w}|t)=\frac{\mathbb{P}(t|\boldsymbol{w})\mathbb{P}(\boldsymbol{w})}{\mathbb{P}(t)}=\frac{\mathbb{P}(\mathcal{C}_i|\boldsymbol{w})\mathbb{P}(\boldsymbol{w})}{\mathbb{P}(t)}=\frac{\mathbb{P}(\mathcal{C}_i|\boldsymbol{x},M)\mathbb{P}(\boldsymbol{w})}{\mathbb{P}(t)}\tag{5}

Then the maximise likelihood function is employed to estimate the parameters. And the likelihood is just:

P(C1x,M)=δ(wTx)=11+ewTx(6)\mathbb{P}(\mathcal{C}_1|\boldsymbol{x},M)=\delta(\boldsymbol{w}^T\boldsymbol{x})=\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x}}}\tag{6}

when we have had the training set:

{x1,t1},{x2,t2},,{xN,tN}(7)\{\boldsymbol{x}_1,t_1\},\{\boldsymbol{x}_2,t_2\},\cdots,\{\boldsymbol{x}_N,t_N\}\tag{7}

The likelihood becomes:

ΠtiC111+ewTxiΠtiC0(111+ewTxi)(8)\Pi_{t_i\in \mathcal{C}_1}\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\Pi_{t_i\in \mathcal{C}_0}(1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})\tag{8}

In the second part, when x\boldsymbol{x} belongs to C0\mathcal{C}_0 the label is 00. The output of this class should approach to 00, so minimising 11+ewTxi\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}} equals to maximising 111+ewTxi1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}. For equation(8) is not convenient to optimise, we can use the property that tn0,1t_n\in{0,1} and we have:

ΠtiC111+ewTxiΠtiC0(111+ewTxi)=ΠiN(11+ewTxi)ti(111+ewTxi)1ti(9)\begin{aligned} &\Pi_{t_i\in \mathcal{C}_1}\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\Pi_{t_i\in \mathcal{C}_0}(1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})\\ =&\Pi_i^N(\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})^{t_i}(1-\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})^{1-t_i} \end{aligned} \tag{9}

From now on, we turn to an optimization problem. Maximizing equation(9) equals to minimize its minus logarithm(we use yiy_i retpresent 11+ewTxi\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}):

E=ln  Πi=1Nyiti(1yi)1ti=i=1N(tilnyi+(1ti)ln(1yi))(10)\begin{aligned} E&=-\mathrm{ln}\;\Pi^N_{i=1}y_i^{t_i}(1-y_i)^{1-t_i}\\ &=-\sum^N_{i=1}(t_i\mathrm{ln}y_i+(1-t_i)\mathrm{ln}(1-y_i)) \end{aligned} \tag{10}

Equation(10) is called cross-entropy, which is a very important concept in information theory and is called cross-entropy error in machine learning which is also a very useful function.

For there is no closed-form solution, ‘steepest descent algorithm’ is employed. And what we need to calculate firstly is the derivative of equation(10). For we want to get the derivative of w\boldsymbol{w} who is in the function yi(x)y_i(\boldsymbol{x}), the chain rule is necessary:

dEdw=ddwi=1N(tilnyi+(1ti)ln(1yi))=i=1Nddwtilnyi+ddw(1ti)ln(1yi)=i=1Ntiyiyi+(1ti)yi1yi(11)\begin{aligned} \frac{dE}{dw}&=-\frac{d}{dw}\sum^N_{i=1}(t_i\mathrm{ln}y_i+(1-t_i)\mathrm{ln}(1-y_i))\\ &=-\sum^N_{i=1}\frac{d}{dw}t_i\mathrm{ln}y_i+\frac{d}{dw}(1-t_i)\mathrm{ln}(1-y_i)\\ &=-\sum^N_{i=1}t_i\frac{y_i'}{y_i}+(1-t_i)\frac{-y'_i}{1-y_i} \end{aligned} \tag{11}

and for

yi=ddw11+ewTxi=xewTxi(1+ewTxi)2(12)\begin{aligned} y_i'&=\frac{d}{dw}\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\\ &=\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{(1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}})^2} \end{aligned}\tag{12}

Substitute equation (12) into equation (11), we have:

dEdw=i=1NtixewTxi(1+ewTxi)211+ewTxi(1ti)xewTxi(1+ewTxi)2ewTxi1+ewTxi=i=1NtixewTxi1+ewTxi(1ti)x1+ewTxi=i=1N(tiewTxi1+ewTxi(1ti)11+ewTxi)x=i=1N(ti(1yi)(1ti)yi)x=i=1N(tiyi)x(13)\begin{aligned} \frac{dE}{dw}&=-\sum_{i=1}^Nt_i\frac{\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{(1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}})^2}}{\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}}-(1-t_i)\frac{\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{(1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}})^2}}{\frac{e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}}\\ &=-\sum_{i=1}^Nt_i\frac{\boldsymbol{x}e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}-(1-t_i)\frac{\boldsymbol{x}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}\\ &=-\sum_{i=1}^N(t_i\frac{e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}}-(1-t_i)\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x_i}}})\boldsymbol{x}\\ &=-\sum_{i=1}^N(t_i(1-y_i)-(1-t_i)y_i)\boldsymbol{x}\\ &=-\sum_{i=1}^N(t_i-y_i)\boldsymbol{x} \end{aligned}\tag{13}

Then we can update w\boldsymbol{w} :

w=wlearning  rate×(1Ni=1N(tiyi)x)(14)\boldsymbol{w} = \boldsymbol{w} - \mathrm{learning\; rate} \times (-\frac{1}{N}\sum_{i=1}^N(t_i-y_i)\boldsymbol{x})\tag{14}

Code for logistic regression

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class LogisticRegression():

def logistic_sigmoid(self, a):
return 1./(1.0 + np.exp(-a))

def fit(self, x, y, e_threshold, lr=1):
x = np.array(x)/320.-1
# augment the input
x_dim = x.shape[0]
x = np.c_[np.ones(x_dim), x]
# initial parameters in 0.01 to 1
w = np.random.randint(1, 100, x.shape[1])/100.
number_of_points = np.size(y)
for dummy in range(1000):
y_output = self.logistic_sigmoid(w.dot(x.transpose()))
# gradient calculation
e_gradient = np.zeros(x.shape[1])
for i in range(number_of_points):
e_gradient += (y_output[i]-y[i])*x[i]
e_gradient = e_gradient / number_of_points
# update parameter
w += -e_gradient*lr
e = 0
for i in range(number_of_points):
e += -(y[i] * np.log(y_output[i]) + (1 - y[i]) * np.log(1 - y_output[i]))
e /= number_of_points
if e <= e_threshold:
break
return w

The entire project can be found The entire project can be found https://github.com/Tony-Tan/ML and please star me.

Two hyperparameters are the learning rate and the stop condition. When the error is lower than the stop value, the algorithm stops.

Experiment 1

with the learning rate 1, and different learning rates lead to different convergence rate:

Experiment 2

with the learning rate 1, and different learning rates lead to different convergence rate:

Experiment 3

with the learning rate 1, and different learning rates lead to different convergence rate:

Several Traps in Logistic Regression

  1. Input value should be nomarlized and centered at 0
  2. Learning rate is chosen corroding to equation (14) but not i=1N(tiyi)x-\sum_{i=1}^N(t_i-y_i)\boldsymbol{x} because the uncertain coefficient NN
  3. parameter vector w\boldsymbol{w} identifies the direction, so its margin can be arbitrary large. and to a large w\boldsymbol{w} , yi(x)y_i(\boldsymbol{x}) is very close to 1 or 0, but it can never be 1 or 0. So there is no optimization position, and the equation(13) can never be 00 which means the algorithm can never stop by himself
  4. Considering 11+ewTx\frac{1}{1+e^{-\boldsymbol{w}^T\boldsymbol{x}}}, when the margin of w\boldsymbol{w} grows, we can write it in a combination of margin MM and direction vector wd\boldsymbol{w}_d: 11+eM(wdTx)\frac{1}{1+e^{-M(\boldsymbol{w_d}^T\boldsymbol{x}})}. And the function 11+eM(x)\frac{1}{1+e^{-M(x)}} varies according MM is like(when MM grows the logistic function is more like Heaviside step function):

References


  1. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎