June 01, 2020
[Artificial Neural Network] Hand Written Digit Recognition using Scikit-Learn
  1. Introduction
  2. Data Preparation
    • 2.1 Load Data
    • 2.2 Check for missing values
    • 2.3 Feature Scaling
    • 2.4 Split into train and evaluation data
  3. Multi Layer Perceptron
  4. Evaluate the model
    • 4.1 Confusion Matrix
  5. Submission Score

 

1. Introduction

In the past two decades Artificial Intelligence has seen a great development, almost surpassing the human level capabilities. Though the concept of artificial neural network is old, it has gained popularity and has been used widely in the recent years. Now we'll implement a simple Multi Layer Perceptron (MLP) which is one of the basic kinds of neural network to identify/classify the images of handwritten digits. This classification problem is taken from kaggle.

 

2. Data Preparation

Go to this URL, download the data and extract if needed into your project's working directory.

# Import all the necessary library
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, plot_confusion_matrix

from sklearn.neural_network import MLPClassifier

2.1 Load Data

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

2.2 Check for missing values

train.isnull().any().describe()
count       785
unique        1
top       False
freq        785
dtype: object

There is only one value (indicated by 'unique = 1') which is 'False', shows there isn't any missing value. 

Sometimes we might end up having imbalanced training data which has lot of frequency bias towards one/some class(es). Mostly this doesn't affect the performance of the classifier as long as there is sufficient training data provided for all the classes/labels. For example, if the training data of 100 images has 90 images of label '1' then the remaining 9 labels (called minority classes) on average, have one training image for each of them. Training an ML model on this data might accurately recognize the images with '1' but it fails to classify the images of remaining 9 labels. 
   
Plot the data distribution to find out if any frequency bias present in the data.

fig = plt.figure(figsize=(15,4)) # 15 width, 4 height in inches
a = fig.add_subplot(1, 2, 1)
train['label'].value_counts().plot(kind='bar')
a.set_title('Training data distribution.', y=-0.2)
plt.show()

There is a little imbalance in the distribution but all the labels have fairly decent number of images for each of the classes.

# Separate the data and labels, convert into numpy arrays
tr_label = train[['label']].T.values
del train['label']
train = train.to_numpy()

test = test.to_numpy()

2.3 Feature Scaling

Feature Scaling or Normalization is an important step in data processing for Machine Learning problems. This helps the algorithm to converge faster.

# Normalizing the values
train = train / 255.0
test = test / 255.0

2.4 Split train and evaluation data

# stratify = y, maintains the same distribution of labels in both train and test splits
x_train, x_eval, y_train, y_eval = train_test_split(train, tr_label[0], 
                                                    test_size=0.10, 
                                                    random_state=4,
                                                   stratify=tr_label[0])

Here I split the training data into two sets, 90% for training and 10% for evaluation respectively. To keep the same distribution in both train and evaluation datasets, I've used Scikit-learn's 'train_test_split' with stratify enabled.

 

# Plot train and evaluation dataset distribution
fig = plt.figure(figsize=(15,4))
a = fig.add_subplot(1, 2, 1)
pd.DataFrame({'label':y_train})['label'].value_counts().plot(kind='bar')
a.set_title("Train data distribution", y=-0.2)

a = fig.add_subplot(1, 2, 2)
pd.DataFrame({'label':y_eval})['label'].value_counts().plot(kind='bar')
a.set_title("Evaluation data distribution", y=-0.2)

plt.show()

Data Distribution after split

Here we can see that both train and evaluation data have similar distribution.

 

3. Multi Layer Perceptron (MLP)

A single layer neural network can only work with data that is linearly separable. In other words it represents a straight line. To learn the complex relationship between the pixels of an image, we need a curve that can classify multiple classes accurately. For that the neural network should have more than one layer. 
   
 For simplicity and quick learning, I'm choosing an MLP with three layers. Input layer, one hidden layer and an output layer. One hidden layer is sufficient for majority of problems. Another tough question to answer is, what is the size of the hidden layer? For most of the problems we can get a decent performance by keeping a single hidden layer with size equal to the mean of the sizes of input layer and output layer.

Here input layer size = size of input (28 * 28)

output layer size = number of classes (10)

mlp = MLPClassifier(hidden_layer_sizes=(397,), 
                    activation='logistic', 
                    alpha=1e-4,
                    solver='sgd', 
                    tol=1e-6, 
                    random_state=1, 
                    learning_rate='adaptive',
                    learning_rate_init=.1, 
                    verbose=False, max_iter=100)

mlp.fit(x_train, y_train)
/home/ml/anaconda3/envs/tf2/lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:585: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (100) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
MLPClassifier(activation='logistic', hidden_layer_sizes=(397,),
              learning_rate='adaptive', learning_rate_init=0.1, max_iter=100,
              random_state=1, solver='sgd', tol=1e-06)

To complete the training quickly I've choosen to run upto 100 epochs ('max_iter=100'), which made the training to stop even before convergence. If we run it more for epochs say upto 500,  we might achieve a better accuracy.

 

4. Evaluate the model

# Got it from model trained by skipping the step 2.3
losses_no_norm = mlp.loss_curve_
# Got it from the model trained by including the step 2.3
losses = mlp.loss_curve_
fig = plt.figure(figsize=(15,4))
a = fig.add_subplot(1, 2, 1)
plt.plot(np.arange(len(losses_no_norm)), losses_no_norm)
a.set_title("Loss curve without normalization", y=-0.2)

a = fig.add_subplot(1, 2, 2)
plt.plot(np.arange(len(losses)), losses)
a.set_title("Loss Curve with normalization", y=-0.2)

plt.show()

 

From the above plots, we can clearly see that the normalization actually helps in training to converge faster and smoother.

predictions = mlp.predict(x_eval)
print("Evaluation Accuracy = ", accuracy_score(y_eval, predictions))
Evaluation Accuracy =  0.9736190476190476

4.1 Confusion Matrix

plot_confusion_matrix(mlp, x_eval, y_eval)

Confusion Matrix

We can observe that our trained model has performed well for all the digits with very few errors. However the mis-classification related to 9 and 4 is more compared to other digits. There are 10 images that classified as '9' but is actually is '4' and 7 images that are classified as '4' but actually is '9'.

 

5. Submission Score

result = mlp.predict(test)
result = pd.DataFrame({'ImageId':range(1, len(result)+1), 'Label':result})
result.to_csv('submission_digit_recog.csv', index=False)

Submission of this output has given me a score of 0.97400 in Kaggle which is very close to the evaluation score.

 

If you find the article helpful, please share it across your network. Thanks in advance!

 

Credits: