# Handwritten numeral recognition based on KNN algorithm

Posted by tekparasite on Tue, 23 Nov 2021 05:15:06 +0100

## KNN nearest neighbor classification algorithm:

(nearest neighbor sampling) Proximity algorithm, or K-nearest neighbor (kNN) classification algorithm, is one of the simplest methods in machine learning classification technology. The so-called k nearest neighbors means k nearest neighbors, which means that each sample can be represented by its nearest K neighbors.

It belongs to supervised learning with category mark, and KNN is inert learning. It is called memory based learning, also known as instance based learning. It has no obvious pre training process. After the program runs and loads the data into memory, it can be classified without training.

Algorithm implementation: 1. Calculate the distance between each sample point and the test point 2. Select the nearest K samples and obtain their label s 3. Then find the label with the largest number in the K samples and return the label

The essence of KNN is based on a method of data statistics.

The following is the application of KNN case: handwritten numeral recognition. My case here is in text format. There are no picture conversion steps. Material model: (source code + material will be pasted with githup link at the end) ## KNN handwritten numeral recognition

Implementation idea:

• The test data is converted into a 0-1 matrix with only one column, and all (L) training data are also converted into a 0-1 matrix with only one column by the above method
• Store L single column data into the new matrix A - each column of matrix A stores all the information of a word
• Calculate the distance between the test data and each column in matrix A, and the obtained L distances are stored in the distance array
• Take the index of the training set corresponding to the smallest K distances from the distance array, and the value with the most indexes is the predicted value

Step 1: Import module:

```import os,time,operator             #Import the os built-in library to read the file name and import time to test the efficiency
import pandas as pd                 #Import data processing library pandas installation method pip install pandas
import numpy as np                  #Import scientific computing library numpy installation method pip install numpy
import matplotlib.pyplot as plt     #Import drawing library matplotlib installation method pip install matplotlib```

Step 2: Introduce a file to define a function to read data and convert data

```## print(len(tarining))      #1934 Training sets                     ## print(len(test))   #945 test sets
trainingDigits =r'D:\work\Daily task 6 Machine Learning\day2 Handwritten numeral recognition\trainingDigits'
testDigits = r'D:\work\Daily task 6 Machine Learning\day2 Handwritten numeral recognition\testDigits'
## ↑ data path
tarining = (os.listdir(trainingDigits))                 ## Read training set
test = (os.listdir(testDigits))                        ## Read test set
def read_file(doc_name):                             ## Define a function that converts 32x32 format to 1 line
data=np.zeros((1,1024))                         ## Create a zero array
f=open(doc_name)                               ## Open file
for i in range(32):                          ## It is known that there are 32 rows and 32 columns in each file
for j in range(32):                   ## Take each column in each row
data[0,32*i+j]=int(hang[j])      ## Give data value
# print(pd.DataFrame(data))             ## Do not convert to DataFrame here.
return data                           ## Otherwise, the test set efficiency will be reduced by 7 times
## The efficiency of reading training set will be reduced by 12 times```

Step 3: Define a function to convert a field to a list, which will be used later. Because I didn't use the Dataframe in pandas to operate data in order to improve efficiency.

```def dict_list(dic:dict):               ## Defines a function that converts a dictionary to a list
keys = dic.keys()                 ## dic.keys() is the k of the dictionary
values = dic.values()            ## dic.values() is the V of the dictionary
lst = [(key,val) for  key,val in zip(keys, values)] ## for k,v in zip(k,v)
return lst                        ## zip is an iteratable object
## Returns a list```

Step 4: Define similarity function:

```def xiangsidu(tests,xunlians,labels,k):    ## tests:Test set # xulians:Training sample set # labels:label # k: Number of adjacent
data_hang=xunlians.shape              ## Get row number data of training set_ hang
zu=np.tile(tests,(data_hang,1))-xunlians   ## Use tile to reconstruct the test set tests into a data_hang 1-dimensional array of rows and columns
q=np.sqrt((zu**2).sum(axis=1)).argsort()     ## After calculating the distance, sort from low to high, and argsort returns the index
my_dict = {}                                   ## Set a dict
for i in range(k):                              ## According to our k to count the occurrence frequency and sample category
votelabel=labels[q[i]]                         ## q[i] is the index value, and the corresponding label is obtained through labels
my_dict[votelabel] = my_dict.get(votelabel,0)+1   ## Count the number of times per label
sortclasscount=sorted(dict_list(my_dict),key=operator.itemgetter(1),reverse=True)
## Get the value corresponding to the votelabel key. If there is no, return to the default value
return sortclasscount                        ## Returns the most frequent category```

Step 5: Write recognition function:

```def shibie():                                        ## Define a function that recognizes handwritten digits
label_list = []                                    ## Store the training set in a matrix and store its label
train_length = len(tarining)                        ## Directly obtain the length of training set at one time
train_zero = np.zeros((train_length,1024))           ## Create a zeros array of (training set length, 1024) dimensions
for i in range(train_length):                         ## By traversing the length of the training set
doc_name = tarining[i]                              ## Get all file names
file_label = int(doc_name)                         ## Take the label of the first file in the file name
label_list.append(file_label)                           ## Add labels to handlabel
train_zero[i,:] = read_file(r'%s\%s'%(trainingDigits,doc_name))## Array converted to 1024
## Here is the test set
errornum = 0                                                  ## Record the initial value of error
testnum = len(test)                                         ## Get the length of the test set as above
errfile = []                                              ## Define an empty list
for i in range(testnum):                               ## Put each test sample into the training set and test with KNN
testdoc_name = test[i]                           ## Obtain the files in the test set by using i as a subscript
test_label = int(testdoc_name)              ## Get the name of the test file and get our digital label
result = xiangsidu(testdataor, train_zero, label_list, 3)  ## The call to xiangsidu returned result
print("Testing %d, The content is %d" % (test_label,result))    ## Output result and label
if (result != test_label):                               ## Determine whether the label is equal to the test name
errornum += 1                                       ## If not, + 1 record times
errfile.append(testdoc_name)                       ## And add the wrong file name to the error list
print("Number of errors :%d" % errornum)                       ## Number of output errors
print("Wrong are :%s"%[i for i in errfile])             ## Output the name in the wrong list
print("Accuracy %.2f%%" % ((1 - (errornum / float(testnum))) * 100)) ## Calculation accuracy```

Last call:

```if __name__ == '__main__':                                        ## Declare main function
a = time.time()                                              ## Set start time
shibie()                                                   ## Call test function
b= time.time() - a                                       ## Calculate run time
print("Running time:",b)                                   ## Output run time```

There's nothing to say. In order to improve efficiency, many skilled operations have been carried out in the middle, although it's still a pile of for loops. But the notes of each step are very clear. I believe you can understand it. If you don't understand it, please leave a message. 