KNN algorithm (with iris classification Implementation)

Posted by nick2005 on Wed, 19 Jan 2022 04:19:54 +0100

1.k-nearest neighbor algorithm

K-Nearest Neighbor learning (KNN) learning is a commonly used supervised learning method. Its working mechanism is very simple: given a test sample, find the K samples closest to it in the training set based on a certain distance measurement, and then predict through the k neighbor samples. If there are many neighbors in that category, the test sample is considered to be of that category. It is similar to "voting". The following figure is a real column of a KNN binary classification problem. It can be seen that the classification of test samples will be different according to the value of K, but they are all based on the voting of his neighbors.

The steps of KNN algorithm are given below:

① Calculate the distance between the unknown sample and the real column of all known samples.

② Sort distances from small to large

③ Select the nearest k known real columns

④ These k real neighbors vote. If the number of votes in which category is more, k is considered to be the member of that category

The calculated distance is generally Euclidean distance (Euclidean distance):

The Euclidean distance shown in the figure is:.
The disadvantages of the algorithm are: ① the complexity of the algorithm is high, and the distance between all known samples and samples to be classified needs to be calculated.

② When the sample distribution is unbalanced, for example, when one type of sample is too large (too many real columns) and dominated, the new unknown real column is easy to be classified as the dominant sample. It should be that the number of real columns of this type of sample is too large, but the new unknown real column is not actually close to the target sample. As for the black unknown sample points shown in the figure below, because the purple sample distribution is more dense than the red sample distribution, it is easy to divide the unknown sample into purple categories during classification, and it is obviously reasonable to divide it into red categories through observation.

2. Implementation of KNN algorithm

import matplotlib.pyplot as plt
import numpy as np
import operator
# Known classified data
x1 = np.array([3,2,1])
y1 = np.array([104,100,81])
x2 = np.array([101,99,98])
y2 = np.array([10,5,2])
scatter1=plt.scatter(x1,y1,c='r')
scatter2=plt.scatter(x2,y2,c='b')
# Unknown data
x = np.array([18])
y = np.array([90])
scatter3 = plt.scatter(x,y,c='k')
plt.legend(handles=[scatter1,scatter2,scatter3],labels=['romance','action','none'],loc='best')
plt.show()



x_data=np.array([[3,104],
                   [2,100],
                   [1,81],
                   [101,10],
                   [99,5],
                   [81,2]])
y_data=np.array(['A','A','A','B','B','B'])
x_test=np.array([18,90])
np.tile(x_test,(x_data.shape[0],1))#Assignment test
array([[18, 90],
       [18, 90],
       [18, 90],
       [18, 90],
       [18, 90],
       [18, 90]])
#Calculate x_data and x_test distance
distance=(((np.tile(x_test,(x_data.shape[0],1))-x_data)**2).sum(axis=1))**0.5
print(distance)
[ 20.51828453  18.86796226  19.23538406 115.27792503 117.41379817
 108.2266141 ]
#Sort from small to large
sotrdistance=distance.argsort()
sotrdistance
array([1, 2, 0, 5, 3, 4], dtype=int64)
classCount = {}
#Set k=5
k=5
for i in range(k):
    label=y_data[sotrdistance[i]]#Get label
    classCount[label]=classCount.get(label,0)+1#Count the number of tags

classCount
{'A': 3, 'B': 2}
# According to the operator Itemsetter (1) - the first value sorts classCount, and then reverses it
sotrclassCount=sorted(classCount.items(),key=operator.itemgetter(1), reverse=True)
sotrclassCount
[('A', 3), ('B', 2)]
#Get the largest number of tags
knn=sotrclassCount[0][0]
knn#So x_test should belong to category A
'A'

sorted(iterable, cmp=None, key=None, reverse=False)
Iteratable – iteratable object.
cmp - the function of comparison. This function has two parameters. The values of the parameters are taken from the iteratable object. The rules that this function must follow are: if it is greater than, it returns 1, if it is less than, it returns - 1, and if it is equal to, it returns 0.
key – it is mainly used to compare elements. There is only one parameter. The parameters of the specific function are taken from the iteratable object. Specify an element in the iteratable object to sort.
Reverse – collation, reverse = True descending, reverse = False ascending (default).

#operator. Itemsetter function
#The itemsetter function provided by the operator module is used to obtain the data of which dimensions of the object, and the parameters are some sequence numbers.
a = [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
b=operator.itemgetter(1)
b(a)
('jane', 'B', 12)

3. Realizing iris classification by KNN

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import random
from sklearn.metrics import classification_report 
from sklearn.model_selection import train_test_split
import operator
#knn
def knn(x_test,x_train,y_train,k):
    #Calculate x_test to X_ Euclidean distance of each point in the train
    distance=(((np.tile(x_test,(x_train.shape[0],1))-x_train)**2).sum(axis=1))**0.5
    sortdistance=distance.argsort()
    classCount={}
    for i in range(k):
        label=y_train[sortdistance[i]]
        classCount[label]=classCount.get(label,0)+1
        #Count the number of labels
    sortClassCount=sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    #Sort the values of the label dictionary keys
    #Get the largest number of tags
    return sortClassCount[0][0]
iris=datasets.load_iris()
x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.2)
#Split dataset
print(x_test.shape,x_train.shape)
predictions=[]
for i in range(x_test.shape[0]):#For each x_test application knn
    predictions.append(knn(x_test[i],x_train,y_train,6))
print(classification_report(y_test,predictions))
(30, 4) (120, 4)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      0.92      0.96        12
           2       0.88      1.00      0.93         7

    accuracy                           0.97        30
   macro avg       0.96      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30

#Implemented with sklearn
from sklearn.neighbors import KNeighborsClassifier
model=KNeighborsClassifier(n_neighbors=6)
model.fit(x_train,y_train)
prediction1=model.predict(x_test)
print(classification_report(y_test,prediction1))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      0.92      0.96        12
           2       0.88      1.00      0.93         7

    accuracy                           0.97        30
   macro avg       0.96      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30

Topics: Algorithm Machine Learning