Fisher Linear Discriminant(LDA)

Keywords: Fisher linear discriminant, LDA

Idea of Fisher linear discriminant[1]

‘Least-square method’ in classification can only deal with a small set of tasks. That is because it was built for the regression task. However, we want a method to solve linear classification especially. Then we come to the famous Fisher linear discriminant. This method is also discriminative for it gives directly the class which the input x\boldsymbol{x} belongs to. Assuming that the linear function


is employed as before. And the threshold function f(y)={1 if y00 otherwise f(\boldsymbol{y})=\begin{cases}1 &\text{ if } y\leq 0\\0 &\text{ otherwise }\end{cases} is used as well. Then if y<0y<0 or equivalenttly wTxw0\boldsymbol{w}^T\boldsymbol{x}\leq -w_0 , x\boldsymbol{x} belongs to C1\mathcal{C}_1, or it belongs to C2\mathcal{C}_2.

We decide the input x\boldsymbol{x} belongs to which classes by yy which is a 1-dimensional variable. However, our input vector dimension is mm, which is usually greater than 2. Although the dimensions of data are reduced during this process which always comes with information loss, we still want the data in the lower dimension is possible to be separated. Under this situation, our mission is converted to find a w\boldsymbol{w} that can maximize the separation between different classes in 1-dimensional space.

The first strategy comes to us is to maximize the distance between the projections of the centers of different classes on a certain line. Then the first step is to get the center of a class Ck\mathcal{C}_k whose size is NkN_k and center is:

mk=1NkxiCkxi(2)\boldsymbol{m}_k=\frac{1}{N_k}\sum_{x_i\in \mathcal{C}_k}\boldsymbol{x}_i\tag{2}

So the distance between the projections m1m_1, m2m_2 of centers of the two classes, m1\boldsymbol{m}_1, m2\boldsymbol{m}_2 is:



mk=wTmk(4)m_k = \boldsymbol{w}^T\boldsymbol{m}_k\tag{4}

And for the w\boldsymbol{w} here is referred to as the direction vector, so its margin can be arbitrary. Luckily, the margin is not an element for our research. So, to avoid mkm_k going to infinity, we can set the length of w\boldsymbol{w} to be 1:


When w\boldsymbol{w} has the same direction with m1m2\boldsymbol{m}_1-\boldsymbol{m}_2, equation(3) get its maximum value.

Then the result looks like:

The blue star and circle are the center of red stars and green circles, respectively. and the blue solid line is the line we want on which the projections of center points have the longest distance.

With our observation, this line does not give the optimum solution. Because, although the projections of centers have the longest distance on this line, the projections of all sample points scatter into a relatively large area.

This phenomenon exists in our daily life. For example, when two seats are close, the people sitting on them do not have enough space(this is the original condition). Then we move the seats and make them a little far from each other(make the projections of centers far from each other). But this time, two fat guys come to sit on them(the projection has a big variance), then space is still not enough.

In our problem, the projections of the points of a class are the fat guys. We need to make the centers(projection of centers) far away from each other and make the points(projections of points in a class) slender that means a smaller variance at the same time.

The variance of the projected points of class kk can be calculated by:

sk2=nCk(ynmk)2(6)s^2_k=\sum_{n\in \mathcal{C}_k}(y_n - m_k)^2\tag{6}

and it is also called within-class variance.

To make the seat comfortable for people who sit on them, we need to make the seat as far as possible to each other(maximize (m2m1)2(m_2-m_1)^2) and only let thin people sit(minimize the sum of within-class variance). Fisher criterion satisfies this requirements:


And J(w)J(\boldsymbol{w}) in details is:

J(w)=(m2m1)2s12+s22=(wT(m1m2))T(wT(m1m2))nC1(ynm1)2+nC2(ynm2)2=(wT(m1m2))T(wT(m1m2))nC1(wTxnwTm1)2+nC2(wTxnwTm2)2=wT(m1m2)(m1m2)TwwT(nC1(xnm1)(xnm1)T+nC2(xnm2)(xnm2)T)w(8)\begin{aligned} J(\boldsymbol{w})&=\frac{(m_2-m_1)^2}{s_1^2+s_2^2}\\ &=\frac{(\boldsymbol{w}^T(\boldsymbol{m}_1-\boldsymbol{m}_2))^T(\boldsymbol{w}^T(\boldsymbol{m}_1-\boldsymbol{m}_2))}{\sum_{n\in \mathcal{C}_1}(y_n - m_1)^2+\sum_{n\in \mathcal{C}_2}(y_n - m_2)^2}\\ &=\frac{(\boldsymbol{w}^T(\boldsymbol{m}_1-\boldsymbol{m}_2))^T(\boldsymbol{w}^T(\boldsymbol{m}_1-\boldsymbol{m}_2))} {\sum_{n\in \mathcal{C}_1}(\boldsymbol{w}^T\boldsymbol{x}_n - \boldsymbol{w}^T\boldsymbol{m}_1)^2+\sum_{n\in \mathcal{C}_2}(\boldsymbol{w}^T\boldsymbol{x}_n - \boldsymbol{w}^T\boldsymbol{m}_2)^2}\\ &=\frac{\boldsymbol{w}^T(\boldsymbol{m}_1-\boldsymbol{m}_2)(\boldsymbol{m}_1-\boldsymbol{m}_2)^T\boldsymbol{w}} {\boldsymbol{w}^T(\sum_{n\in \mathcal{C}_1}(\boldsymbol{x}_n - \boldsymbol{m}_1)(\boldsymbol{x}_n - \boldsymbol{m}_1)^T+\sum_{n\in \mathcal{C}_2}(\boldsymbol{x}_n - \boldsymbol{m}_2)(\boldsymbol{x}_n - \boldsymbol{m}_2)^T)\boldsymbol{w}} \end{aligned}\tag{8}

And we can set:

SB=(m1m2)(m1m2)TSW=nC1(xnm1)(xnm1)T+nC2(xnm2)(xnm2)T(9)\begin{aligned} S_B &= (\boldsymbol{m}_1-\boldsymbol{m}_2)(\boldsymbol{m}_1-\boldsymbol{m}_2)^T\\ S_W &= \sum_{n\in \mathcal{C}_1}(\boldsymbol{x}_n - \boldsymbol{m}_1)(\boldsymbol{x}_n - \boldsymbol{m}_1)^T+\sum_{n\in \mathcal{C}_2}(\boldsymbol{x}_n - \boldsymbol{m}_2)(\boldsymbol{x}_n - \boldsymbol{m}_2)^T \end{aligned}\tag{9}

where SBS_B represent the covariance matrix Between classes, and SWS_W represent the Within classes covariance. Then the equation(8) becomes:

J(w)=wTSBwwTSWw(10)\begin{aligned} J(\boldsymbol{w})&=\frac{\boldsymbol{w}^TS_B\boldsymbol{w}}{\boldsymbol{w}^TS_W\boldsymbol{w}} \end{aligned}\tag{10}

To maximise the J(w)J(\boldsymbol{w}), we should differentiat equation (10) with respect to w\boldsymbol{w} firstly:

wJ(w)=wwTSBwwTSWw=(SB+SBT)w(wTSWw)(wTSBw)(SW+SWT)w(wTSWw)T(wTSWw)(11)\begin{aligned} \frac{\partial }{\partial \boldsymbol{w}}J(\boldsymbol{w})&=\frac{\partial }{\partial \boldsymbol{w}}\frac{\boldsymbol{w}^TS_B\boldsymbol{w}}{\boldsymbol{w}^TS_W\boldsymbol{w}}\\ &=\frac{(S_B+S_B^T)\boldsymbol{w}(\boldsymbol{w}^TS_W\boldsymbol{w})-(\boldsymbol{w}^TS_B\boldsymbol{w})(S_W+S_W^T)\boldsymbol{w}}{(\boldsymbol{w}^TS_W\boldsymbol{w})^T(\boldsymbol{w}^TS_W\boldsymbol{w})} \end{aligned}\tag{11}

and set it to zero:

(SB+SBT)w(wTSWw)(wTSBw)(SW+SWT)w(wTSWw)T(wTSWw)=0(12) \frac{(S_B+S_B^T)\boldsymbol{w}(\boldsymbol{w}^TS_W\boldsymbol{w})-(\boldsymbol{w}^TS_B\boldsymbol{w})(S_W+S_W^T)\boldsymbol{w}}{(\boldsymbol{w}^TS_W\boldsymbol{w})^T(\boldsymbol{w}^TS_W\boldsymbol{w})}=0\tag{12}

and then:

(SB+SBT)w(wTSWw)=(wTSBw)(SW+SWT)w(wTSWw)SBw=(wTSBw)SWw(13)\begin{aligned} (S_B+S_B^T)\boldsymbol{w}(\boldsymbol{w}^TS_W\boldsymbol{w})&=(\boldsymbol{w}^TS_B\boldsymbol{w})(S_W+S_W^T)\boldsymbol{w}\\ (\boldsymbol{w}^TS_W\boldsymbol{w})S_B\boldsymbol{w}&=(\boldsymbol{w}^TS_B\boldsymbol{w})S_W\boldsymbol{w} \end{aligned}\tag{13}

Because (wTSWw)(\boldsymbol{w}^TS_W\boldsymbol{w}) and (wTSBw)(\boldsymbol{w}^TS_B\boldsymbol{w}) are scalars and according equation (9) when we multiply both sides by w\boldsymbol{w} and we have

SBw=(m1m2)((m1m2)Tw)(14)S_B \boldsymbol{w}= (\boldsymbol{m}_1-\boldsymbol{m}_2)((\boldsymbol{m}_1-\boldsymbol{m}_2)^T\boldsymbol{w})\tag{14}

and (m1m2)Tw(\boldsymbol{m}_1-\boldsymbol{m}_2)^T\boldsymbol{w} is a scalar and SBwS_B\boldsymbol{w} have the same direction with (m1m2)(\boldsymbol{m}_1-\boldsymbol{m}_2)

so the equation (13) can be written as:

wSW1SBwwSW1(m1m2)(15)\begin{aligned} \boldsymbol{w}&\propto S^{-1}_WS_B\boldsymbol{w}\\ \boldsymbol{w}&\propto S^{-1}_W(\boldsymbol{m}_1-\boldsymbol{m}_2) \end{aligned}\tag{15}

So, up to now, we have had a result parameter vector w\boldsymbol{w} based on maximizing Fisher Criterion.


The code of the process is relatively easy:

class LinearClassifier():

def fisher(self, x, y):
x = np.array(x)
x_dim = x.shape[1]
m_1 = np.zeros(x_dim)
m_1_size = 0
m_2 = np.zeros(x_dim)
m_2_size = 0
for i in range(len(y)):
if y[i] == 0:
m_1 = m_1 + x[i]
m_1_size += 1
m_2 = m_2 + x[i]
m_2_size += 1
if m_1_size != 0 and m_2_size != 0:
m_1 = (m_1/m_1_size).reshape(-1, 1)
m_2 = (m_2/m_2_size).reshape(-1, 1)
s_c_1 = np.zeros([x_dim, x_dim])
s_c_2 = np.zeros([x_dim, x_dim])
for i in range(len(y)):
if y[i] == 0:
s_c_1 += (x[i] - m_1).dot((x[i] - m_1).transpose())
s_c_2 += (x[i] - m_2).dot((x[i] - m_2).transpose())
s_w = s_c_1 + s_c_2

return np.linalg.inv(s_w).dot(m_2-m_1)

The entire project can be found and please star me.

The results of the code(where the line is the one to which the points are projected):


  1. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎