If you wish to go through the concept of Fisher's LDA go through my previous post Fisher's Linear Discriminant.

As an example, consider six points namely (2,2), (4,3) and (5,1) of Class 1 and (1,3), (5,5) and (3,6) of Class 2. Copy the following data into data.txt in your working folder,

2,2,1

1,3,2

4,3,1

5,1,1

5,5,2

3,6,2

Here the last column indicates the class label(1 or 2). We can load this file into python using 'open' function in python.

### Step 1: Loading Data File

Create a python file called LDA.py and add the following code to it.

```
import numpy as np
def LoadFile(filename): #load input file containing training data
lines = open(filename, "rb")
dataset =[]
for line in lines:
line = line.strip().split(',')
dataset.append(line)
dataset = np.array(dataset).astype(np.float64)
return dataset
filename = 'data.txt'
dataset = LoadFile(filename)
```

We can check this function by printing the dataset. Insert "print dataset" at the last and run, you should get the result as

```
[[ 2. 2. 1.]
[ 1. 3. 2.]
[ 4. 3. 1.]
[ 5. 1. 1.]
[ 5. 5. 2.]
[ 3. 6. 2.]]
```

### Step:2 Summerizing Data

In this step we separate the data by class. Add the following code to the above file soon after the LoadFile() function.

```
def ByClass(dataset): #separate data by class
classes = np.unique(dataset[:,-1])
div_class = {}
for i in classes:
div_class[i] = dataset[dataset[:,-1] == i]
return div_class
```

We can check this by printing div_class value. i.e add following code at the end of LDA.py and run it.

```
div_data = ByClass(dataset)
print div_data
```

This should give the result as

```
{1.0: array([[ 2., 2., 1.],
[ 4., 3., 1.],
[ 5., 1., 1.]]),
2.0: array([[ 1., 3., 2.],
[ 5., 5., 2.],
[ 3., 6., 2.]])}
```

### Step:3 Finding Mean

Create function called mean and write the logic to calculate atributewise mean of the given data as follows.

```
def Mean(data):
mean = data.mean(axis = 0)
return mean
```

Here we are using numpy's mean function to calculate column wise means of the data. Test it by adding the following code to the end of the file.

```
#remove the last column as it is representing only class labels
mean = Mean(dataset[:,:-1])
print mean
```

The result would be...

`[ 3.33333333 3.33333333]`

### Step:4 Finding optimum Direction Vector W

This is the main and Important task in LDA. We first find the within class scatter matrix(σw) and then find the Direction vector W as

where

```
def main(dataset): #assuming given a two class problem
div_data = ByClass(dataset)
class1, class2 = div_data
class1_data, class2_data = div_data[class1], div_data[class2]
class1_data = class1_data[:,:-1]
class2_data = class2_data[:,:-1]
mean1 = Mean(class1_data)
mean2 = Mean(class2_data)
mean = Mean(dataset[:,:-1])
mean1, mean2, mean = mean1.T, mean2.T, mean.T
print mean
m,n = class1_data.shape
diff1 = class1_data - np.array(list(mean1)*m).reshape(m,n)
m,n = class2_data.shape
diff2 = class2_data - np.array(list(mean2)*m).reshape(m,n)
diff = np.concatenate([diff1, diff2])
m, n = diff.shape
withinClass = np.zeros((n,n))
diff = np.matrix(diff)
for i in xrange(m):
withinClass += np.dot(diff[i,:].T, diff[i,:])
print withinClass
opt_dir_vector = np.dot(np.linalg.inv(withinClass), (mean1 - mean2))
print 'Vector(W) = ', np.matrix(opt_dir_vector).T
```

To test it add the following code at the bottom of the file and run,

```
filename = 'data.txt'
dataset = LoadFile(filename)
main(dataset)
```

You should get the output a 4x4 matrix as

```
[[ 12.66666667 3. ]
[ 3. 6.66666667]]
Vector(W) = [[ 0.16494845]
[-0.4742268 ]]
```

Finally we have calculated the optimum direction vector W, in which direction the separation between the two classes is maximum. Now we have to define a threshold point to classify the any given pattern.

i.e For a two class problem the classifier is

Decide or

In this the threshold value, w1 we should calculate. The simple metric for this threshold is the average of the projected means of the two classes on to W axis.

i.e if m1 is the projected mean of class 1, and m2 is the projected mean of class 2. Then,

For that create a function called Threshold() as

```
def Threshold(vector, data1, data2):
mu1 = Mean(np.dot(vector, data1.T))
mu2 = Mean(np.dot(vector, data2.T))
print mu1, mu2
return (mu1+mu2)/2
```

and add the following lines of code to main() function

```
threshold = Threshold(opt_dir_vector, class1_data, class2_data)
print 'Threshold = ', threshold
```

So, We have completed the training the classifier. Combining the total code will be as fillows...

### Step:5 Joining all the code Blocks

```
import numpy as np
def LoadFile(filename): #load input file containing training data
lines = open(filename, "rb")
dataset =[]
for line in lines:
line = line.strip().split(',')
dataset.append(line)
dataset = np.array(dataset).astype(np.float64)
return dataset
def ByClass(dataset): #separate data by class
classes = np.unique(dataset[:,-1])
div_class = {}
for i in classes:
div_class[i] = dataset[dataset[:,-1] == i]
return div_class
def Mean(data):
mean = data.mean(axis = 0)
return mean
def Threshold(vector, data1, data2):
mu1 = Mean(np.dot(vector, data1.T))
mu2 = Mean(np.dot(vector, data2.T))
print mu1, mu2
return (mu1+mu2)/2, mu1, mu2
def main(dataset): #assuming given two class problem
div_data = ByClass(dataset)
class1, class2 = div_data
class1_data, class2_data = div_data[class1], div_data[class2]
class1_data = class1_data[:,:-1] #removing the class labels from the data as they are not required to calculate mean
class2_data = class2_data[:,:-1]
mean1 = Mean(class1_data)
mean2 = Mean(class2_data)
mean = Mean(dataset[:,:-1])
mean1, mean2, mean = mean1.T, mean2.T, mean.T
m,n = class1_data.shape
diff1 = class1_data - np.array(list(mean1)*m).reshape(m,n)
m,n = class2_data.shape
diff2 = class2_data - np.array(list(mean2)*m).reshape(m,n)
diff = np.concatenate([diff1, diff2])
m, n = diff.shape
withinClass = np.zeros((n,n))
diff = np.matrix(diff)
for i in xrange(m):
withinClass += np.dot(diff[i,:].T, diff[i,:])
opt_dir_vector = np.dot(np.linalg.inv(withinClass), (mean1 - mean2))
print 'Vector = ', np.matrix(opt_dir_vector).T
threshold = Threshold(opt_dir_vector, class1_data, class2_data)
print 'Threshold = ', threshold, 'm1 = ', mu1, 'm2 = ', mu2
if __name__ == '__main__':
filename = 'data.txt'
dataset = LoadFile(filename)
main(dataset)
```

The final Output should be

```
Vector = [[ 0.16494845]
[-0.4742268 ]]
Threshold = -1.03092783505 m1 = -0.343642611684 m2 = -1.71821305842
```

So given a test pattern X then if

- W
^{T}X > Threshold then X is classified as Class 1 - W
^{T}X < Threshold then X is classified as Class 2

You can use this code for any two class numerical problem.

The Python code used in the above post can be downloaded from Github

Last time we have discussed about **Web Scraping** with Python's **BeautifulSoup**. In this post I'll explain how to scrape ...

In the last post we went through the web scraping techniques in detail. Now we'll implement the **HTML parsing **techniques ...

The world is moving fast and every day we see new technologies coming in. Right from the **live traffic** and **wether updates ...**