K-Nearest Neighbor (Test Version 3)

Posted by tomms on Mon, 03 Jan 2022 04:33:45 +0100

K-Nearest Neighbor Classification Algorithm Case
My friend Helen has been using the online dating website to find the right person for her. Although dating websites recommend different people, she doesn't find anyone she likes. After some summary, she found that she has met three types of people: "people who don't like it", "people who are charming" and "people who are very charming". Despite finding these patterns, Helen still can't categorize the matches recommended by dating websites appropriately. She feels that she can date the most attractive people from Monday to Friday, while she prefers to spend the weekend with the most attractive people. Helen wants our categorization software to better help her categorize matches into exact categories. Helen also collects data that dating websites haven't recorded, which she believes will help her categorize matches better.
1. Data preparation
Helen has been collecting dating data for some time, and she keeps it in a text file dating TestSet. In the txt, each sample takes up one row, with a total of 106 Helen samples featuring three main features:'Flight frequent miles per year','Percentage of time spent playing video Games'and'Ice cream liters consumed per week'.
2. Data Preprocessing
Due to inconsistent dimensions of data attributes, normalization is required.
3. Data Modeling
Under the guidance of classification accuracy criterion, the optimal parameter k is determined by ten-fold cross-validation.
4. Model prediction
Now that we've tested the classifier on our data, we can finally use it to classify people for Helen. We'll give Helen a little program where she finds someone on the dating website and enters his information. The program gives her a prediction of how much she likes her partner.

import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt


df=pd.read_csv("datingTestSet.txt",sep="\t",names=['flight','ice','game','type'])
features=df.drop(columns=['type'],axis=0)
targets=df['type']

X_train,X_test,y_train,y_test=train_test_split(features,targets,test_size=0.25,random_state=2022)
#Standardization
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.fit_transform(X_test)

# Combining K-fold cross-validation with learning curve to obtain optimal K-value
scores_cross = []
Ks = []
for k in range(3, 20):
    knn = KNeighborsClassifier(n_neighbors=k)  # Instantiate Model Object
    score_cross = cross_val_score(knn, X_train, y_train, cv=6).mean()  # Cross-validation based on training data and return the score of cross-validation
    scores_cross.append(score_cross)
    Ks.append(k)

# Convert to Array Type
scores_arr = np.array(scores_cross)
Ks_arr = np.array(Ks)
# Draw learning curve
plt.plot(Ks, scores_cross)
plt.show()
# Get the highest score and the array subscript corresponding to the highest score to get the optimal K value
score_best = scores_arr.max()  # Find the highest score in the array of stored scores
index_best = scores_arr.argmax()  # Find the subscript corresponding to the highest score in array
Ks_best = Ks_arr[index_best]  # Find the K value corresponding to the highest score based on the subscript
print(Ks_best)  

# The optimal K value found by the combination of K-fold cross validation and learning curve is used to instantiate the model object
knn=KNeighborsClassifier(n_neighbors=Ks_best)
knn.fit(X_train,y_train)
score=knn.score(X_test,y_test)
print("score:",score)

dscore=knn.predict([[40000,10,0.5]])
print("Forecast results:",dscore[0])


Handwriting recognition is a common image recognition task. Computers recognize characters in pictures by handwriting pictures. Unlike typefaces, handwriting styles and sizes vary from person to person, making it difficult for computers to recognize handwriting. Digital handwriting recognition has become a relatively simple handwriting recognition task due to its limited categories (0~9 of 10 numbers).
1. Data preparation
There are 204 txt files in the file "trainingDigits" folder of the compressed file "digits.zip", each of which corresponds to one of the numbers "1-9" (processed as a 0-1 digital matrix of 3232), as shown below. Read this 204 file and convert the data in each file into 1024-dimensional row vectors for training set. Note: Take "0_2.txt" as an example, where 0 in the file name represents the corresponding number 0 for this file, that is, the class label for this data vector is 2. Tip: Get all the files in the current folder using from os import listdir, convert them to vectors, and use the split("") of the string [0] Method gets the class label of each data.
2. Model building
The optimal number of neighbors K is selected by cross-validation.
3. Forecast classification
Reads all the data files in the folder testDigits in "digits.zip", vectorizes them using the above methods, and predicts them using the built model.
4. Model evaluation
What is the prediction classification accuracy of the model?

from os import listdir
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

name1=listdir('digits/trainingDigits')
name2=listdir('digits/testDigits')
train_data=[]
train_label=[]

test_data=[]
test_label=[]
for name in name1:
    label=int(name.split("_")[0])
    train_label.append(label)
    Vect = np.zeros((1, 1024))
    fr = open("./digits/trainingDigits/{}".format(name))
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            Vect[0, 32*i+j] = int(lineStr[j])
    train_data.append(list(Vect[0]))
for name in name2:
    label=int(name.split("_")[0])
    test_label.append(label)
    Vect = np.zeros((1, 1024))
    fr = open("./digits/trainingDigits/{}".format(name))
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            Vect[0, 32*i+j] = int(lineStr[j])
    test_data.append(list(Vect[0]))

train_data=np.array(train_data)
test_data=np.array(test_data)
# Combining K-fold cross-validation with learning curve to obtain optimal K-value
scores_cross = []
Ks = []
for k in range(3, 20):
    knn = KNeighborsClassifier(n_neighbors=k)  # Instantiate Model Object
    score_cross = cross_val_score(knn, train_data, train_label, cv=6).mean()  # Cross-validation based on training data and return the score of cross-validation
    scores_cross.append(score_cross)
    Ks.append(k)

# Convert to Array Type
scores_arr = np.array(scores_cross)
Ks_arr = np.array(Ks)
# learning curve
plt.plot(Ks, scores_cross)
plt.show()
# Get the highest score and the array subscript corresponding to the highest score to get the optimal K value
score_best = scores_arr.max()  # Find the highest score in the array of stored scores
index_best = scores_arr.argmax()  # Find the subscript corresponding to the highest score in array
Ks_best = Ks_arr[index_best]  # Find the K value corresponding to the highest score based on the subscript
print(Ks_best)  

#model training
knn=KNeighborsClassifier(n_neighbors=Ks_best)
knn.fit(train_data,train_label)
score=knn.score(test_data,test_label)
print("score:",score)

#Scoring curve
import matplotlib.pyplot as plt
k_range = range(1, 51)
k_scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(train_data,train_label)
    score=knn.score(test_data,test_label)
    k_scores.append(score)

plt.plot(k_range, k_scores)
plt.xlabel('Vlaue of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.show()


Also attached is image processing, because our data is processed txt, the real data is png pictures

from PIL import Image
import matplotlib.pylab as plt
import numpy as np 

def picTo01(filename):
    #Read pictures
    img=Image.open(filename).convert('RGBA')
    raw_data=img.load()
    #Noise reduction based on the RGB value of the picture
    for y in range(img.size[1]):
       for x in range(img.size[0]):
           if raw_data[x,y][0]<90:
               raw_data[x,y]=(0,0,0,255)
 
    for y in range(img.size[1]):
       for x in range(img.size[0]):
           if raw_data[x,y][1]<136:
               raw_data[x,y]=(0,0,0,255)

    for y in range(img.size[1]):
       for x in range(img.size[0]):
           if raw_data[x,y][2]>0:
               raw_data[x,y]=(255,255,255,255)
# Convert pictures to 32x32
    img=img.resize((32,32),Image.LANCZOS)
#Save the processed pictures for easy comparison
    img.save('test'+filename.split('.')[0]+'.png')
#Make the black part of the picture 1 and the white part 0 according to the formula
    array=plt.array(img)
    gray_array=np.zeros((32,32))
    for x in range(array.shape[0]):
       for y in range(array.shape[1]):
           gray=0.299*array[x][y][0]+0.587 * array[x][y][1] + 0.114 * array[x][y][2]
           #white
           if gray==255:
               gray_array[x][y]=0
            #black
           else:
                gray_array[x][y]=1
    #Name the txt file this number and save it
    name01=filename.split('.')[0]
    name01=name01+'.txt'
    np.savetxt(name01,gray_array,fmt='%d',delimiter='')
picTo01('./png/0_03.png')
# import os.path
# import numpy as np
# #Convert 32*32 to 1*1024
# def img32to1024(filename):
#     returnVect=np.zeros((1,1024))
#     fr=open(filename,'r')
#     for i in range(32):
#         lineStr=fr.readline()
#         for j in range(32):
#             returnVect[0,32*i+j]=int(lineStr[j])
#     return returnVect

# #Collate train data and labels (file name label)
# hwLabels=[]
# trainingFileList=os.listdir('png')
# m=len(trainingFileList)
# trainingMat=np.zeros((m,1024))
# for i in range(m):
#     fileNameStr=trainingFileList[i]
#     fileStr=fileNameStr.split('.')[0]
#     classNumStr=int(fileStr.split('_')[0])
#     hwLabels.append(classNumStr)
#     trainingMat[i,:]=img32to1024('trainingDigits/%s' % fileNameStr)

# import os.path
# import numpy as np
# import train as tr
# def classify(inX,k):
#     dataSetSize=tr.m
#     diffMat=inX-tr.trainingMat
#     sqDiffMat=diffMat**2
#     sqDistances=sqDiffMat.sum(axis=1)
#     distances=sqDistances**0.5
#     #get the rank
#     sortedDistances=distances.argsort()
#     classCount={}
#     for i in range(k):
#         voteIlabel=tr.hwLabels[sortedDistances[i]]
#         classCount[voteIlabel]=classCount.get(voteIlabel,0)+1
#     sortedClassCount=sorted(classCount.items(),reverse=True)
#     return sortedClassCount[0][0]



Topics: Machine Learning Deep Learning Data Mining