Handwritten Digit Recognition using Machine Learning in Python

5 min readJun 8, 2021

Recognizing handwritten text is a problem that can be traced back to the first automatic machines that needed to recognize individual characters in handwritten documents. Think about, for example, the ZIP codes on letters at the post office and the automation needed to recognize these five digits. Perfect recognition of these codes is necessary in order to sort mail automatically and efficiently. Included among the other applications that may come to mind is OCR (Optical Character Recognition) software. OCR software must read handwritten text, or pages of printed books, for general electronic documents in which each character is well defined. But the problem of handwriting recognition goes farther back in time, more precisely to the early 20th Century (1920s), when Emanuel Goldberg (1881–1970) began his studies regarding this issue and suggested that a statistical approach would be an optimal choice.

Hypothesis :

The Digits data set of the Scikit-learn library provides numerous data-sets that are useful for testing many problems of data analysis and prediction of the results. Some Scientist claims that it predicts the digit accurately 95% of the times. Perform data Analysis to accept or reject this Hypothesis.

Prerequists :

Sklearn

Matplotlib

Basics of Machine learning

Dataset :

In this project, we are using the Handwritten Digits dataset which is already ready in the sklearn library. we can import the dataset using the below code.

from sklearn import datasets
digits = datasets.load_digits()

Digits dataset is a dictionary that contains data, targets, images, features names, description of the dataset, target names, etc.

We focus mainly on data and targets. We extract both on different variables.

main_data = digits[‘data’]
targets = digits[‘target’]

Now we can see our data using the following code.

def view_digit(index):
    plt.imshow(digits.images[index] , cmap = plt.cm.gray_r , interpolation = 'nearest')
    plt.title('Orignal it is: '+ str(digits.target[index]))
    plt.show()view_digit(17)

Model Planning

To see how different models work on different data sizes we are using 3 models Support vector Classifier, Decision Tree Classifier, Random Forest Classifier.

Support Vector Classifier :

The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

More…

Ref:- https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

Code :

# import the SVC
from sklearn import svm
svc = svm.SVC(gamma=0.001 , C = 100.) 
# gamma and C are hyperparameters# Training data = 1790 , Validation data = 6
svc.fit(main_data[:1790] , targets[:1790])# predict on test data
predictions = svc.predict(main_data[1791:])# check the result 
predictions , targets[1791:]

As we can see we use very high data for training and very little data for training and the Support Vector Classifier do a very good job on data and we get 100 % accuracy on test data.

2. Decision Tree Classifier :

Decision Tree Classifier is a simple and widely used classification technique. It applies a straightforward idea to solve the classification problem. Decision Tree Classifier poses a series of carefully crafted questions about the attributes of the test record. Each time it receives an answer, a follow-up question is asked until a conclusion about the class label of the record is reached.

More Details on Decision Tree…

Ref:- http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/lguo/decisionTree.html

Code :

# import the Classifier
from sklearn.tree import DecisionTreeClassifier# Instanciate Model
# we can also use criterion = 'entropy' both lead us to nearly same 
# resultdt = DecisionTreeClassifier(criterion = 'gini') # fit the data on model
# Training Set = 1600 , Validation Set = 197
dt.fit(main_data[:1600] , targets[:1600])# prediction on test data
predictions2 = dt.predict(main_data[1601:])# We use classification materics as accuracy_score
# import accuracy_score
from sklearn.metrics import accuracy_scoreaccuracy_score(targets[1601:] , predictions2)

Now for this time, we use different sizes of training data and validation data. As we can see Decision Tree classifier performs poor performance on the data. we can increase the accuracy by fine-tuning the hyperparameters of DTC.

Hyperparameters of Decision Tree Classifier

3. Random Forest Classifier :

Random forests is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy-to-use algorithm. A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests create decision trees on randomly selected data samples, get a prediction from each tree, and selects the best solution by means of voting. It also provides a pretty good indicator of the feature’s importance.

arnab132 - Overview

Certified Professional in the field of Computer Science. Member of ASTM International. Love coding and designing. Block…

github.com

Final Note :

Thanks for reading! If you liked this article, please hit the clap👏 button as many times as you can. It would mean a lot and encourage me to keep sharing my knowledge.

Handwritten Digit Recognition using Machine Learning in Python

arnab132 - Overview

Certified Professional in the field of Computer Science. Member of ASTM International. Love coding and designing. Block…

Written by Arnab Dey