Keywords: Fisher linear discriminant, LDA

## Idea of Fisher linear discriminant[1]

‘Least-square method’ in classification can only deal with a small set of tasks. That is because it was built for the regression task. However, we want a method to solve linear classification especially. Then we come to the famous Fisher linear discriminant. This method is also discriminative for it gives directly the class which the input $\boldsymbol{x}$ belongs to. Assuming that the linear function

$y=\boldsymbol{w}^T\boldsymbol{x}+w_0\tag{1}$

is employed as before. And the threshold function $f(\boldsymbol{y})=\begin{cases}1 &\text{ if } y\leq 0\\0 &\text{ otherwise }\end{cases}$ is used as well. Then if $y<0$ or equivalenttly $\boldsymbol{w}^T\boldsymbol{x}\leq -w_0$ , $\boldsymbol{x}$ belongs to $\mathcal{C}_1$, or it belongs to $\mathcal{C}_2$.

We decide the input $\boldsymbol{x}$ belongs to which classes by $y$ which is a 1-dimensional variable. However, our input vector dimension is $m$, which is usually greater than 2. Although the dimensions of data are reduced during this process which always comes with information loss, we still want the data in the lower dimension is possible to be separated. Under this situation, our mission is converted to find a $\boldsymbol{w}$ that can maximize the separation between different classes in 1-dimensional space.

The first strategy comes to us is to maximize the distance between the projections of the centers of different classes on a certain line. Then the first step is to get the center of a class $\mathcal{C}_k$ whose size is $N_k$ and center is:

$\boldsymbol{m}_k=\frac{1}{N_k}\sum_{x_i\in \mathcal{C}_k}\boldsymbol{x}_i\tag{2}$

So the distance between the projections $m_1$, $m_2$ of centers of the two classes, $\boldsymbol{m}_1$, $\boldsymbol{m}_2$ is:

$m_1-m_2=\boldsymbol{w}^T(\boldsymbol{m}_1-\boldsymbol{m}_2)\tag{3}$

where

$m_k = \boldsymbol{w}^T\boldsymbol{m}_k\tag{4}$

And for the $\boldsymbol{w}$ here is referred to as the direction vector, so its margin can be arbitrary. Luckily, the margin is not an element for our research. So, to avoid $m_k$ going to infinity, we can set the length of $\boldsymbol{w}$ to be 1:

$||\boldsymbol{w}||=1\tag{5}$

When $\boldsymbol{w}$ has the same direction with $\boldsymbol{m}_1-\boldsymbol{m}_2$, equation(3) get its maximum value.

Then the result looks like:

The blue star and circle are the center of red stars and green circles, respectively. and the blue solid line is the line we want on which the projections of center points have the longest distance.

With our observation, this line does not give the optimum solution. Because, although the projections of centers have the longest distance on this line, the projections of all sample points scatter into a relatively large area.

This phenomenon exists in our daily life. For example, when two seats are close, the people sitting on them do not have enough space(this is the original condition). Then we move the seats and make them a little far from each other(make the projections of centers far from each other). But this time, two fat guys come to sit on them(the projection has a big variance), then space is still not enough.

In our problem, the projections of the points of a class are the fat guys. We need to make the centers(projection of centers) far away from each other and make the points(projections of points in a class) slender that means a smaller variance at the same time.

The variance of the projected points of class $k$ can be calculated by:

$s^2_k=\sum_{n\in \mathcal{C}_k}(y_n - m_k)^2\tag{6}$

and it is also called within-class variance.

To make the seat comfortable for people who sit on them, we need to make the seat as far as possible to each other(maximize $(m_2-m_1)^2$) and only let thin people sit(minimize the sum of within-class variance). Fisher criterion satisfies this requirements:

$J(\boldsymbol{w})=\frac{(m_2-m_1)^2}{s_1^2+s_2^2}\tag{7}$

And $J(\boldsymbol{w})$ in details is:

\begin{aligned} J(\boldsymbol{w})&=\frac{(m_2-m_1)^2}{s_1^2+s_2^2}\\ &=\frac{(\boldsymbol{w}^T(\boldsymbol{m}_1-\boldsymbol{m}_2))^T(\boldsymbol{w}^T(\boldsymbol{m}_1-\boldsymbol{m}_2))}{\sum_{n\in \mathcal{C}_1}(y_n - m_1)^2+\sum_{n\in \mathcal{C}_2}(y_n - m_2)^2}\\ &=\frac{(\boldsymbol{w}^T(\boldsymbol{m}_1-\boldsymbol{m}_2))^T(\boldsymbol{w}^T(\boldsymbol{m}_1-\boldsymbol{m}_2))} {\sum_{n\in \mathcal{C}_1}(\boldsymbol{w}^T\boldsymbol{x}_n - \boldsymbol{w}^T\boldsymbol{m}_1)^2+\sum_{n\in \mathcal{C}_2}(\boldsymbol{w}^T\boldsymbol{x}_n - \boldsymbol{w}^T\boldsymbol{m}_2)^2}\\ &=\frac{\boldsymbol{w}^T(\boldsymbol{m}_1-\boldsymbol{m}_2)(\boldsymbol{m}_1-\boldsymbol{m}_2)^T\boldsymbol{w}} {\boldsymbol{w}^T(\sum_{n\in \mathcal{C}_1}(\boldsymbol{x}_n - \boldsymbol{m}_1)(\boldsymbol{x}_n - \boldsymbol{m}_1)^T+\sum_{n\in \mathcal{C}_2}(\boldsymbol{x}_n - \boldsymbol{m}_2)(\boldsymbol{x}_n - \boldsymbol{m}_2)^T)\boldsymbol{w}} \end{aligned}\tag{8}

And we can set:

\begin{aligned} S_B &= (\boldsymbol{m}_1-\boldsymbol{m}_2)(\boldsymbol{m}_1-\boldsymbol{m}_2)^T\\ S_W &= \sum_{n\in \mathcal{C}_1}(\boldsymbol{x}_n - \boldsymbol{m}_1)(\boldsymbol{x}_n - \boldsymbol{m}_1)^T+\sum_{n\in \mathcal{C}_2}(\boldsymbol{x}_n - \boldsymbol{m}_2)(\boldsymbol{x}_n - \boldsymbol{m}_2)^T \end{aligned}\tag{9}

where $S_B$ represent the covariance matrix Between classes, and $S_W$ represent the Within classes covariance. Then the equation(8) becomes:

\begin{aligned} J(\boldsymbol{w})&=\frac{\boldsymbol{w}^TS_B\boldsymbol{w}}{\boldsymbol{w}^TS_W\boldsymbol{w}} \end{aligned}\tag{10}

To maximise the $J(\boldsymbol{w})$, we should differentiat equation (10) with respect to $\boldsymbol{w}$ firstly:

\begin{aligned} \frac{\partial }{\partial \boldsymbol{w}}J(\boldsymbol{w})&=\frac{\partial }{\partial \boldsymbol{w}}\frac{\boldsymbol{w}^TS_B\boldsymbol{w}}{\boldsymbol{w}^TS_W\boldsymbol{w}}\\ &=\frac{(S_B+S_B^T)\boldsymbol{w}(\boldsymbol{w}^TS_W\boldsymbol{w})-(\boldsymbol{w}^TS_B\boldsymbol{w})(S_W+S_W^T)\boldsymbol{w}}{(\boldsymbol{w}^TS_W\boldsymbol{w})^T(\boldsymbol{w}^TS_W\boldsymbol{w})} \end{aligned}\tag{11}

and set it to zero:

$\frac{(S_B+S_B^T)\boldsymbol{w}(\boldsymbol{w}^TS_W\boldsymbol{w})-(\boldsymbol{w}^TS_B\boldsymbol{w})(S_W+S_W^T)\boldsymbol{w}}{(\boldsymbol{w}^TS_W\boldsymbol{w})^T(\boldsymbol{w}^TS_W\boldsymbol{w})}=0\tag{12}$

and then:

\begin{aligned} (S_B+S_B^T)\boldsymbol{w}(\boldsymbol{w}^TS_W\boldsymbol{w})&=(\boldsymbol{w}^TS_B\boldsymbol{w})(S_W+S_W^T)\boldsymbol{w}\\ (\boldsymbol{w}^TS_W\boldsymbol{w})S_B\boldsymbol{w}&=(\boldsymbol{w}^TS_B\boldsymbol{w})S_W\boldsymbol{w} \end{aligned}\tag{13}

Because $(\boldsymbol{w}^TS_W\boldsymbol{w})$ and $(\boldsymbol{w}^TS_B\boldsymbol{w})$ are scalars and according equation (9) when we multiply both sides by $\boldsymbol{w}$ and we have

$S_B \boldsymbol{w}= (\boldsymbol{m}_1-\boldsymbol{m}_2)((\boldsymbol{m}_1-\boldsymbol{m}_2)^T\boldsymbol{w})\tag{14}$

and $(\boldsymbol{m}_1-\boldsymbol{m}_2)^T\boldsymbol{w}$ is a scalar and $S_B\boldsymbol{w}$ have the same direction with $(\boldsymbol{m}_1-\boldsymbol{m}_2)$

so the equation (13) can be written as:

\begin{aligned} \boldsymbol{w}&\propto S^{-1}_WS_B\boldsymbol{w}\\ \boldsymbol{w}&\propto S^{-1}_W(\boldsymbol{m}_1-\boldsymbol{m}_2) \end{aligned}\tag{15}

So, up to now, we have had a result parameter vector $\boldsymbol{w}$ based on maximizing Fisher Criterion.

## Code

The code of the process is relatively easy:

The entire project can be found https://github.com/Tony-Tan/ML and please star me.

The results of the code(where the line is the one to which the points are projected):

## References

1. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎