February 03, 2019

If you wish to go through the concept of Fisher's LDA go through my previous post Fisher's Linear Discriminant.

As an example, consider six points namely (2,2), (4,3) and (5,1) of Class 1 and (1,3), (5,5) and (3,6) of Class 2. Copy the following data into data.txt in your working folder,

2,2,1

1,3,2

4,3,1

5,1,1

5,5,2

3,6,2

Here the last column indicates the class label(1 or 2). We can load this file into python using 'open' function in python.

Step 1: Loading Data File

Create a python file called LDA.py and add the following code to it.

 

import numpy as np

def LoadFile(filename):        #load input file containing training data
    lines = open(filename, "rb")
    dataset =[]
    for line in lines:
        line = line.strip().split(',')
        dataset.append(line)
    dataset = np.array(dataset).astype(np.float64)
    return dataset

filename = 'data.txt'
dataset = LoadFile(filename)

 

We can check this function by printing the dataset. Insert "print dataset" at the last and run, you should get the result as

 

[[ 2.  2.  1.]
 [ 1.  3.  2.]
 [ 4.  3.  1.]
 [ 5.  1.  1.]
 [ 5.  5.  2.]
 [ 3.  6.  2.]]

 

Step:2 Summerizing Data

In this step we separate the data by class. Add the following code to the above file soon after the LoadFile() function.

 

def ByClass(dataset):        #separate data by class
    classes = np.unique(dataset[:,-1])
    div_class = {}
    for i in classes:
        div_class[i] = dataset[dataset[:,-1] == i]
    return div_class

 

We can check this by printing div_class value. i.e add following code at the end of LDA.py and run it.

 

div_data = ByClass(dataset)
print div_data

This should give the result as
{1.0: array([[ 2.,  2.,  1.],
       [ 4.,  3.,  1.],
       [ 5.,  1.,  1.]]), 
2.0: array([[ 1.,  3.,  2.],
       [ 5.,  5.,  2.],
       [ 3.,  6.,  2.]])}

 

Step:3 Finding Mean

Create function called mean and write the logic to calculate atributewise mean of the given data as follows.

def Mean(data):
    mean = data.mean(axis = 0)
    return mean

 

Here we are using numpy's mean function to calculate column wise means of the data. Test it by adding the following code to the end of the file.

#remove the last column as it is representing only class labels
mean = Mean(dataset[:,:-1])
print mean

The result would be...

[ 3.33333333  3.33333333]

 

Step:4 Finding optimum Direction Vector W

This is the main and Important task in LDA. We first find the within class scatter matrix(σw) and then find the Direction vector W as

\tiny W = \sigma _w^{-1} * (mean1 - mean2)

where

\tiny \sigma_w = \sum _{Xi \in C_1}(X_i - M_1)(X_i - M_1)^T + \sum _{Xi \in C_2}(X_i - M_2)(X_i - M_2)^T

 

def main(dataset):        #assuming given a two class problem
    div_data = ByClass(dataset)
    class1, class2 = div_data
    class1_data, class2_data = div_data[class1], div_data[class2] 
    class1_data = class1_data[:,:-1]
    class2_data = class2_data[:,:-1]
    mean1 = Mean(class1_data)
    mean2 = Mean(class2_data)
    mean = Mean(dataset[:,:-1])
    mean1, mean2, mean = mean1.T, mean2.T, mean.T
    print mean
    
    
    m,n = class1_data.shape
    diff1 = class1_data - np.array(list(mean1)*m).reshape(m,n)
    m,n = class2_data.shape
    diff2 = class2_data - np.array(list(mean2)*m).reshape(m,n)
    diff = np.concatenate([diff1, diff2])
    m, n = diff.shape
    withinClass = np.zeros((n,n))
    diff = np.matrix(diff)
    for i in xrange(m):
        withinClass += np.dot(diff[i,:].T, diff[i,:])

    print withinClass

    opt_dir_vector = np.dot(np.linalg.inv(withinClass), (mean1 - mean2))
    print 'Vector(W) = ', np.matrix(opt_dir_vector).T

 

To test it add the following code at the bottom of the file and run,

filename = 'data.txt'
dataset = LoadFile(filename)
main(dataset)

 

You should get the output a 4x4 matrix as

[[ 12.66666667   3.        ]
 [  3.           6.66666667]]

Vector(W) =  [[ 0.16494845]
 [-0.4742268 ]]

 

Finally we have calculated the optimum direction vector W, in which direction the separation between the two classes is maximum. Now we have to define a threshold point to classify the any given pattern.

i.e For a two class problem the classifier is

 

Decide  \tiny X \in C_2 \; if \; W^TX + w1 > 0  or  \tiny X \in C_1 \;\; if \;\; W^TX + w1 < 0

 

In this the threshold value, w1 we should calculate. The simple metric for this threshold is the average of the projected means of the two classes on to W axis.

i.e if m1 is the projected mean of class 1, and m2 is the projected mean of class 2. Then,

 

\tiny w1 = (m1 + m2) / 2

 

For that create a function called Threshold() as

def Threshold(vector, data1, data2):
    mu1 = Mean(np.dot(vector, data1.T))
    mu2 = Mean(np.dot(vector, data2.T))
    print mu1, mu2
    return (mu1+mu2)/2

 

and add the following lines of code to main() function

threshold = Threshold(opt_dir_vector, class1_data, class2_data)
print 'Threshold = ', threshold

 

So, We have completed the training the classifier. Combining the total code will be as fillows...

 

Step:5 Joining all the code Blocks

import numpy as np

def LoadFile(filename):        #load input file containing training data
    lines = open(filename, "rb")
    dataset =[]
    for line in lines:
        line = line.strip().split(',')
        dataset.append(line)
    dataset = np.array(dataset).astype(np.float64)
    return dataset

def ByClass(dataset):        #separate data by class
    classes = np.unique(dataset[:,-1])
    div_class = {}
    for i in classes:
        div_class[i] = dataset[dataset[:,-1] == i]
    return div_class

def Mean(data):
    mean = data.mean(axis = 0)
    return mean

def Threshold(vector, data1, data2):
    mu1 = Mean(np.dot(vector, data1.T))
    mu2 = Mean(np.dot(vector, data2.T))
    print mu1, mu2
    return (mu1+mu2)/2, mu1, mu2
    

def main(dataset):        #assuming given two class problem
    div_data = ByClass(dataset)
    class1, class2 = div_data
    class1_data, class2_data = div_data[class1], div_data[class2] 
    class1_data = class1_data[:,:-1]             #removing the class labels from the data as they are not required to calculate mean
    class2_data = class2_data[:,:-1]
    mean1 = Mean(class1_data)
    mean2 = Mean(class2_data)
    mean = Mean(dataset[:,:-1])
    mean1, mean2, mean = mean1.T, mean2.T, mean.T
    
    
    m,n = class1_data.shape
    diff1 = class1_data - np.array(list(mean1)*m).reshape(m,n)
    m,n = class2_data.shape
    diff2 = class2_data - np.array(list(mean2)*m).reshape(m,n)
    diff = np.concatenate([diff1, diff2])
    m, n = diff.shape
    withinClass = np.zeros((n,n))
    diff = np.matrix(diff)
    for i in xrange(m):
        withinClass += np.dot(diff[i,:].T, diff[i,:])
    opt_dir_vector = np.dot(np.linalg.inv(withinClass), (mean1 - mean2))
    print 'Vector = ', np.matrix(opt_dir_vector).T
    
    threshold = Threshold(opt_dir_vector, class1_data, class2_data)
    print 'Threshold = ', threshold, 'm1 = ', mu1, 'm2 = ', mu2
    
    
if __name__ == '__main__':    
    filename = 'data.txt'
    dataset = LoadFile(filename)
    main(dataset)

 

The final Output should be

Vector =  [[ 0.16494845]
 [-0.4742268 ]]
Threshold =  -1.03092783505 m1 =  -0.343642611684 m2 =  -1.71821305842

 

So given a test pattern X then if

  1. WTX > Threshold then X is classified as Class 1
  2. WTX < Threshold then X is classified as Class 2

 

You can use this code for any two class numerical problem.

The Python code used in the above post can be downloaded from Github

 


Recent Posts
February 16, 2019

 

Last time we have discussed about Web Scraping with Python's BeautifulSoup. In this post I'll explain how to scrape ...

February 12, 2019

In the last post we went through the web scraping techniques in detail. Now we'll implement the HTML parsing techniques ...

February 11, 2019

The world is moving fast and every day we see new technologies coming in. Right from the live traffic and wether updates ...

Blog-Posts