ML k nearest neighbor learning

Posted by jeremyphphaven on Sun, 07 Nov 2021 21:53:55 +0100

k-nearest neighbor learning

Mostly from watermelon books

1. General

k k k-nearest neighbor( k k K-nearest neighbor (KNN) learning is a commonly used supervised learning method, and its working mechanism is simple:

Given the test sample, find the closest distance in the training set based on a certain distance measure k k k training samples, and then based on this k k Predict the information of k "neighbors". In general, you can use "voting method" in classification tasks and select this method k k The most category markers in k samples are used as the prediction results; The "average method" can be used in the regression task, which is k k The real value of k samples outputs the average value of the mark as the prediction result; Weighted average or weighted voting can also be carried out based on distance. The closer the distance, the greater the weight of the sample.

KNN learning is a famous representative of "lazy learning". This kind of learning technology only saves the samples in the training stage, and the training time cost is zero. It will be processed after receiving the test samples. On the contrary, the method of learning and processing samples in the retraining stage is called "eager learning"

obviously, k k k is an important hyperparameter when k k When k takes different values, the classification results will also be significantly different. On the other hand, if different distance calculation methods are used, the "nearest neighbors" found may be significantly different, which will also lead to significantly different classification results.

2. Distance measurement

The following is taken from( https://blog.csdn.net/weixin_45884316/article/details/115221211)

The distance between two instance points in feature space is the response of their similarity, so k k An important problem in k-nearest neighbor algorithm is to calculate the distance between samples to determine which samples in the training samples are closer to the test samples.

2.1 Euclidean distance

2.2 Manhattan distance

2.3 others

When the existing distance measurement methods can not meet the requirements, we can explore the distance measurement methods that meet the requirements.

In actual data, discrete variables and continuous variables often exist at the same time. How to calculate the sample similarity under this mixed variable is an open problem. A simple method is to perform one hot coding on the discrete variables in the sample before distance calculation, and then select the distance calculation method introduced above for processing.

3. Simple implementation of KNN

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Handwritten KNN algorithm (classification task)
class KNN():
    def __init__(self, k=5):
        self.k = k

    def fit(self, xtrain, ytrain):  # Lazy learning, save training data
        self.xtrain = xtrain
        self.ytrain = ytrain

    # Construct distance calculation function
    def cal_distance(self, x):
        # x is a sample
        # Euclidean distance 
        distance = np.sqrt(np.sum(np.power((np.mat(self.xtrain) - np.mat(x)), 2), axis=1))  # Broadcasting mechanism
        return distance

    # According to k voting law
    def k_vote(self, distance):
        index = np.argsort(distance, axis=0)[:self.k]
        labels = self.ytrain[index]
        labels = [i[0] for i in labels]  # Convert multiple nested array s into list s for easy processing
        label = pd.Series(labels).value_counts().idxmax()
        return label

    # Construct prediction function
    def predict(self, xtest):
        y_pred = []
        for i in xtest:
            distance = self.cal_distance(i)
            label = self.k_vote(distance)
            y_pred.append(label)
        return y_pred

    # Construct scoring score function
    def score(self, xtest, ytest):
        y_pred = self.predict(xtest)
        from sklearn.metrics import accuracy_score
        return accuracy_score(ytest, y_pred)


if __name__ == '__main__':
    # Read dataset
    data = load_breast_cancer()
    # Display in DateFrame format
    X = data.data
    y = data.target
    name = ['Average radius', 'Average texture', 'Average perimeter', 'Average area',
            'Average smoothness', 'Average compactness', 'Average concavity',
            'Average concave point', 'Average symmetry', 'Average fractal dimension',
            'Radius error', 'Texture error', 'Perimeter error', 'Area error',
            'Smoothness error', 'Compactness error', 'Concavity error',
            'Concave error', 'Symmetry error',
            'Fractal dimension error', 'Worst radius', 'Worst texture',
            'Worst boundary', 'Worst area', 'Worst smoothness',
            'Worst compactness', 'Worst depression', 'Worst concave point',
            'Worst symmetry', 'Worst fractal dimension', 'Sick or not']
    data = np.concatenate((X, y.reshape(-1, 1)), axis=1)
    table = pd.DataFrame(data=data, columns=name)

    X = table.iloc[:, :-1]  # features
    y = table.iloc[:, -1]  # label
    Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, y, test_size=0.3, random_state=420)

    clf = KNN()
    clf.fit(np.array(Xtrain), np.array(Ytrain))
    acc = clf.score(np.array(Xtest), np.array(Ytest))
    print(acc)
    
    """
    0.9181286549707602
    """

4. Implementation of sklearn Library

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np


#Read dataset
data = load_breast_cancer()
#Display in DateFrame format
X = data.data
y = data.target
name = ['Average radius','Average texture','Average perimeter','Average area',
    'Average smoothness','Average compactness','Average concavity',
    'Average concave point','Average symmetry','Average fractal dimension',
    'Radius error','Texture error','Perimeter error','Area error',
    'Smoothness error','Compactness error','Concavity error',
    'Concave error','Symmetry error',
    'Fractal dimension error','Worst radius','Worst texture',
    'Worst boundary','Worst area','Worst smoothness',
    'Worst compactness','Worst depression','Worst concave point',
    'Worst symmetry','Worst fractal dimension','Sick or not']
data=np.concatenate((X,y.reshape(-1,1)),axis=1)
table=pd.DataFrame(data=data,columns=name)
X = table.iloc[:,:-1] # features
y = table.iloc[:,-1] # label

# Divide test set and training set (Xtrain,Xtest,Ytrain,Ytest)
Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,y,test_size = 0.2,random_state=420)# random_state random seed random sampling

#modeling
clf = KNeighborsClassifier(n_neighbors=5)
clf = clf.fit(Xtrain,Ytrain)

score = clf.score(Xtest,Ytest)
print(score)
'''
0.9385964912280702
'''

5. Draw the learning curve to determine the optimal super parameters k k k

import matplotlib.pyplot as plt
score = [] # Score list

k = range(1,20) # Set k range

# Traverse k and calculate scores respectively 
for i in k:
    clf = KNeighborsClassifier(n_neighbors=i)
    clf = clf.fit(Xtrain,Ytrain)

    score.append(clf.score(Xtest,Ytest))
    
plt.plot(k,score)
_ = plt.xticks(range(1,20)) # x-axis scale
plt.show

It is only used for learning notes, and can be infringed or deleted

Topics: Algorithm

Programmer Think