Handwritten numeral recognition based on KNN algorithm

Posted by tekparasite on Tue, 23 Nov 2021 05:15:06 +0100

KNN nearest neighbor classification algorithm:

(nearest neighbor sampling) Proximity algorithm, or K-nearest neighbor (kNN) classification algorithm, is one of the simplest methods in machine learning classification technology. The so-called k nearest neighbors means k nearest neighbors, which means that each sample can be represented by its nearest K neighbors.

It belongs to supervised learning with category mark, and KNN is inert learning. It is called memory based learning, also known as instance based learning. It has no obvious pre training process. After the program runs and loads the data into memory, it can be classified without training.

Algorithm implementation: 1. Calculate the distance between each sample point and the test point 2. Select the nearest K samples and obtain their label s 3. Then find the label with the largest number in the K samples and return the label

The essence of KNN is based on a method of data statistics.

The following is the application of KNN case: handwritten numeral recognition. My case here is in text format. There are no picture conversion steps. Material model: (source code + material will be pasted with githup link at the end)

KNN handwritten numeral recognition

Implementation idea:

  • The test data is converted into a 0-1 matrix with only one column, and all (L) training data are also converted into a 0-1 matrix with only one column by the above method
  • Store L single column data into the new matrix A - each column of matrix A stores all the information of a word
  • Calculate the distance between the test data and each column in matrix A, and the obtained L distances are stored in the distance array
  • Take the index of the training set corresponding to the smallest K distances from the distance array, and the value with the most indexes is the predicted value

Step 1: Import module:

import os,time,operator             #Import the os built-in library to read the file name and import time to test the efficiency
import pandas as pd                 #Import data processing library pandas installation method pip install pandas
import numpy as np                  #Import scientific computing library numpy installation method pip install numpy
import matplotlib.pyplot as plt     #Import drawing library matplotlib installation method pip install matplotlib

Step 2: Introduce a file to define a function to read data and convert data

## print(len(tarining))      #1934 Training sets                     ## print(len(test))   #945 test sets
trainingDigits =r'D:\work\Daily task 6 Machine Learning\day2 Handwritten numeral recognition\trainingDigits'
testDigits = r'D:\work\Daily task 6 Machine Learning\day2 Handwritten numeral recognition\testDigits'
                                                      ## ↑ data path
tarining = (os.listdir(trainingDigits))                 ## Read training set
test = (os.listdir(testDigits))                        ## Read test set
def read_file(doc_name):                             ## Define a function that converts 32x32 format to 1 line
    data=np.zeros((1,1024))                         ## Create a zero array
    f=open(doc_name)                               ## Open file
    for i in range(32):                          ## It is known that there are 32 rows and 32 columns in each file
        hang=f.readline()                       ## Take line
        for j in range(32):                   ## Take each column in each row
            data[0,32*i+j]=int(hang[j])      ## Give data value
    # print(pd.DataFrame(data))             ## Do not convert to DataFrame here.
    return data                           ## Otherwise, the test set efficiency will be reduced by 7 times
                                         ## The efficiency of reading training set will be reduced by 12 times

Step 3: Define a function to convert a field to a list, which will be used later. Because I didn't use the Dataframe in pandas to operate data in order to improve efficiency.

def dict_list(dic:dict):               ## Defines a function that converts a dictionary to a list
    keys = dic.keys()                 ## dic.keys() is the k of the dictionary
    values = dic.values()            ## dic.values() is the V of the dictionary
    lst = [(key,val) for  key,val in zip(keys, values)] ## for k,v in zip(k,v)
    return lst                        ## zip is an iteratable object
                                        ## Returns a list

Step 4: Define similarity function:

def xiangsidu(tests,xunlians,labels,k):    ## tests:Test set # xulians:Training sample set # labels:label # k: Number of adjacent
    data_hang=xunlians.shape[0]              ## Get row number data of training set_ hang
    zu=np.tile(tests,(data_hang,1))-xunlians   ## Use tile to reconstruct the test set tests into a data_hang 1-dimensional array of rows and columns
    q=np.sqrt((zu**2).sum(axis=1)).argsort()     ## After calculating the distance, sort from low to high, and argsort returns the index
    my_dict = {}                                   ## Set a dict
    for i in range(k):                              ## According to our k to count the occurrence frequency and sample category
        votelabel=labels[q[i]]                         ## q[i] is the index value, and the corresponding label is obtained through labels
        my_dict[votelabel] = my_dict.get(votelabel,0)+1   ## Count the number of times per label
    sortclasscount=sorted(dict_list(my_dict),key=operator.itemgetter(1),reverse=True)
                                                         ## Get the value corresponding to the votelabel key. If there is no, return to the default value
    return sortclasscount[0][0]                        ## Returns the most frequent category

Step 5: Write recognition function:

def shibie():                                        ## Define a function that recognizes handwritten digits
    label_list = []                                    ## Store the training set in a matrix and store its label
    train_length = len(tarining)                        ## Directly obtain the length of training set at one time
    train_zero = np.zeros((train_length,1024))           ## Create a zeros array of (training set length, 1024) dimensions
    for i in range(train_length):                         ## By traversing the length of the training set
        doc_name = tarining[i]                              ## Get all file names
        file_label = int(doc_name[0])                         ## Take the label of the first file in the file name
        label_list.append(file_label)                           ## Add labels to handlabel
        train_zero[i,:] = read_file(r'%s\%s'%(trainingDigits,doc_name))## Array converted to 1024
                                                                    ## Here is the test set
    errornum = 0                                                  ## Record the initial value of error
    testnum = len(test)                                         ## Get the length of the test set as above
    errfile = []                                              ## Define an empty list
    for i in range(testnum):                               ## Put each test sample into the training set and test with KNN
        testdoc_name = test[i]                           ## Obtain the files in the test set by using i as a subscript
        test_label = int(testdoc_name[0])              ## Get the name of the test file and get our digital label
        testdataor = read_file(r'%s\%s' %(testDigits,testdoc_name)) ## Call read_file operation test set
        result = xiangsidu(testdataor, train_zero, label_list, 3)  ## The call to xiangsidu returned result
        print("Testing %d, The content is %d" % (test_label,result))    ## Output result and label
        if (result != test_label):                               ## Determine whether the label is equal to the test name
            errornum += 1                                       ## If not, + 1 record times
            errfile.append(testdoc_name)                       ## And add the wrong file name to the error list
    print("Number of errors :%d" % errornum)                       ## Number of output errors
    print("Wrong are :%s"%[i for i in errfile])             ## Output the name in the wrong list
    print("Accuracy %.2f%%" % ((1 - (errornum / float(testnum))) * 100)) ## Calculation accuracy

Last call:

if __name__ == '__main__':                                        ## Declare main function
    a = time.time()                                              ## Set start time
    shibie()                                                   ## Call test function
    b= time.time() - a                                       ## Calculate run time
    print("Running time:",b)                                   ## Output run time

There's nothing to say. In order to improve efficiency, many skilled operations have been carried out in the middle, although it's still a pile of for loops. But the notes of each step are very clear. I believe you can understand it. If you don't understand it, please leave a message.

Github full link: https://github.com/lixi5338619/KNN_Distinguish/tree/master