February 03, 2019

If you are new to data science, first go through the introduction to Data Mining.

Fisher's Linear Discriminator is a linear supervised classifier. Supervised classification refers to the classification being carried out where labeled training examples are available to learn the classifier.

First we go through the concept derivation and next we implement it in Python.

Fisher Linear Discriminant is used to map a d-dimentional data to one dimentional data using a projection vector W such that it maps a vector X to a scalar WTX(Note: WTX is a scalar). Classification is then carried out in one dimentional space.

i.e For a two class problem the classifier is

Decide $\dpi{100} X \in C_{-1} \text{ if }W^TX + w0 > 0 \text{ and } X \in C_{-0} \text{ if } W^TX + w0 < 0$
Hence one can think of the best V as the direction along which the two classes are well separated.

We project the data along the direction of W. Separation between points of different classes in the projected data is a good way to rate how good is W. In the above fingure, on the second (right side) image the separation between the two classes is more. So we can easily proceed with the classification.

So, now our task is to find the axis W which will give the optimum separation between two class data points. After we have found the W, we project the data along that axis and we will find the separating hyperplane f(x).

Fisher Linear Discriminant is a formal way to find this optimum separation direction. Let M0 and M1 are the means of data from the two classes C-0 and C-1 respectively.

$M0 = (1/n_{0})\sum_{X_{i}\, \epsilon\, C-0}Xi \; \;\;\;\;\;\; and \;\;\;\;\; M1 = (1/n_{1})\sum_{Xi\,\epsilon\,C-1} Xi$

The corresponding means of projected Data would be

$m0 = W^{T}M_{0} \;\;\;\;\; and \;\;\;\;\; m1 = W^T M_1$

The difference (m0 - m1) gives us an idea of the separation between samples of the two classes after projecting the data onto the direction W.
Hence we want a W such that it maximizes (m0 - m1)2. We can do this by increasing magnitude of W linearly. But that is of no use. We have to make it as independent of W and relative to the variances.

So we can define a Fisher's maximizing criterion function as

$J(W) = (m1 - m0)^2 / (s_0^2 + s_1^2)$

Where s0 and s1 are proportional to variences of class 0 and class 1 respectively

We now write J into more convenient form

\dpi{100} \dpi{100} \begin{align*} \qquad (m1 - m0)^2 &= (W^TM_1 - W^TM_0)^2 \\ & = W^T(M_1 - M_0)(M_1 - M_0)^TW \\ & = W^TSBW \\ & where\;\; SB = (M1 - M0)(M1 - M0)^T \end{align*}

This SB is a dxd matrix of rank 1. It is called "Between class scatter matrix".

Similarly we can write s0 and s1 also as quadratic forms

\dpi{100} \begin{align*} \qquad s_0^2 &= W^T[\sum _{X_i \in C-0}(X_i - M_0)(X_i - M_0)^T]Ws_1^2 \\ & = W^T[\sum_{X_i \in C-1}(X_i - M_0)(X_i - M_0)^T]W \\ & = W^TS_BW \\ & where S_B = (M_1 - M_0)(M_1 - M_0)^T \end{align}

This SB is a dxd matrix of rank 1. It is called "Between class scatter matrix".

Similarly we can write s0 and s1 also as quadratic forms

\dpi{100} \begin{align*} \qquad s_0^2 &= W^T[\sum _{X_i \in C-0} (X_i - M_0)(X_i - M_0)^T]Ws_1^2 \\ & = W^T[\sum _{X_i \in C-1} (X_i - M_0)(X_i - M_0)^T]W \end{align*}

and we can write

\dpi{100} \begin{align*} \qquad s_0^2 + s_1^2 &= W^TS_WW \\ \\ & Where \; S_W = \sum _{X_i \in C-0}(X_i - M_0)(X_i - M_0)^T + \sum _{X_i \in C-1}(X_i - M_0)(X_i - M_0)^T \end{align*}

This Sw also a dxd matrix and called "With in class scatter matrix".

That we can write  J(W) = (WTSBW) / (WTSWW) and is a ratio of two quadratic forms. We can find a max point by derivating it with respect to W and equating to Zero. So, that gives

$\dpi{100} \qquad (2S_BW / W^TS_WW) - (W^TS_BW / (W^TS_WW)^2)2S_WW = 0$

WTSWW, (WTSBW / (WTSWW)2) are scalars here. So we can write as follows

$\dpi{100} S_WW = \lambda S_BW$

Thus any maximization of the J(W) has to satisfy SWW = λSBW for some constant λ. This is a generalized eigen value problem.

\dpi{100} \begin{align*} W &= \lambda S_W^{-1}S_BW \\ \\ & here \; SBW = (M_1 - M_0)(M_1 - M_0)^TW = k(M_1 - M_0) \\ \\ &\because (M_1 - M_0)^TW \; \text{is a scalar constant} \end{align*}

Here we are interested in direction vector so, we can neglect constants.

$\dpi{100} => W = S_W^{-1}(M_1 - M_0)$

So finally we have found the direction vector W that maximizes the separation between the projections of the two class data.

Recent Posts
February 16, 2019

Last time we have discussed about Web Scraping with Python's BeautifulSoup. In this post I'll explain how to scrape ...

February 12, 2019

In the last post we went through the web scraping techniques in detail. Now we'll implement the HTML parsing techniques ...

February 11, 2019

The world is moving fast and every day we see new technologies coming in. Right from the live traffic and wether updates ...

Categories
Blog-Posts
Tags