knn(k-nearest neighbor algorithm) -- python

Posted by marcusb on Sun, 30 Jan 2022 07:06:35 +0100

catalogue

1. Basic definitions

2. Algorithm principle

2.1 advantages and disadvantages of the algorithm

2.2 algorithm parameters

2.3 variants

3. Distance formula in the algorithm

4. Case realization

4.1 import related libraries

4.2 reading data

4.3 read variable name

4.4 define X,Y data

4.5 separate training set and test set

4.6 calculation of Euclidean distance

4.7 # visual distance matrix

4.8 forecast samples

4.9 check the accuracy

4.10 cross validation

5. Implementation of scikit learn algorithm

5.1 re realization of the above:

5.2 another implementation method

1. Basic definitions

k-nearest neighbor algorithm is a relatively simple machine learning algorithm. It uses the method of measuring the distance between different eigenvalues for classification. Its idea is very simple: if most of the samples of multiple nearest neighbors (most similar) in the feature space belong to a category, the sample also belongs to this category. The first letter k can be lowercase, indicating the number of externally defined nearest neighbors.

In short, let the machine itself according to the distance of each point, and the close ones are classified as one kind.

2. Algorithm principle

The core idea of knn algorithm is the category of unlabeled samples, which is decided by the nearest k neighbors.
Specifically, suppose we have a labeled data set. At this time, there is an unmarked data sample. Our task is to predict the category of this data sample. The principle of knn is to calculate the distance between the sample to be labeled and each sample in the data set, and take the nearest K samples. The category of the sample to be marked is generated by the voting of the k nearest samples.
Suppose X_test is the sample to be marked, X_train is a marked data set, and the pseudo code of algorithm principle is as follows:

  1. Traverse x_ For all samples in the train, calculate the relationship between each sample and X_test and save the Distance in the Distance array.
  2. Sort the Distance array, take the nearest k points and record them as X_knn.
  3. In X_ Count the number of each category in KNN, that is, class0 is in X_ There are several samples in KNN, and class1 is in X_ There are several samples in KNN.
  4. The category of samples to be marked is in X_ The category with the largest number of samples in KNN.

2.1 advantages and disadvantages of the algorithm

  • Advantages: high accuracy, high tolerance to outliers and noise.
  • Disadvantages: large amount of calculation and large demand for memory.

2.2 algorithm parameters

The algorithm parameter is k, and the parameter selection needs to be determined according to the data.

  • The larger the k value is, the greater the deviation of the model is, and the less sensitive it is to noise data. When the k value is large, it may cause under fitting;
  • The smaller the k value, the greater the variance of the model. When the k value is too small, it will cause over fitting.

2.3 variants

There are some variants of knn algorithm, one of which can increase the weight of neighbors. By default, the same weight is used when calculating the distance. In fact, you can specify different distance weights for different neighbors. For example, the closer the distance, the higher the weight. This can be achieved by specifying the weights parameter of the algorithm.
Another variant is to replace the nearest k points with points within a certain radius. When the data sampling is uneven, it can have better performance. In scikit learn, the RadiusNeighborsClassifier class implements this algorithm variant.

3. Distance formula in the algorithm

Unlike our linear regression, we have no formula to deduce here. The core of KNN classification algorithm is to calculate the distance, and then classify according to the distance.

In the two-dimensional Cartesian coordinate system, I believe junior high school students should be familiar with this. He has a more common name, rectangular coordinate system. Among them, Euclidean distance is commonly used to calculate the distance between two points. Point A(2,3) and point B(5,6), then the distance of AB is

This is the Euclidean distance. However, there are some differences from what we often encounter. Euclidean distance can calculate multidimensional data, that is, matrix. This can help us solve many problems, so the formula becomes

4. Case realization

We used the knn algorithm and its variants to predict the diabetes of Pina Indians. The dataset can be downloaded from below.
Link: LAN Zuoyun

4.1 import related libraries

# Import related modules
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
import pandas as pd

4.2 reading data

#Read data
data=pd.read_excel('D:\desktop\knn.xlsx')
print(data)

return:

4.3 read variable name

label_need=data.keys()
print(label_need)

return:

4.4 define X,Y data

X = data[label_need].values[:,0:8]
y = data[label_need].values[:,8]
print(X)
print(y)

return:

4.5 separate training set and test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.2)

# Print training set and test set size
print('X_train=', X_train.shape)
print('X_test=', X_test.shape)
print('y_train=', y_train.shape)
print('y_test=', y_test.shape)

return:

 

4.6 calculation of Euclidean distance

# Sample size of test case
num_test = X.shape[0]
# Training instance sample size
num_train = X_train.shape[0]
# Euclidean distance initialization based on training and testing dimensions
dists = np.zeros((num_test, num_train)) 
# Matrix point multiplication of test samples and training samples
M = np.dot(X, X_train.T)
# Test sample matrix square
te = np.square(X).sum(axis=1)
# Square of training sample matrix
tr = np.square(X_train).sum(axis=1)
# Calculate Euclidean distance
dists = np.sqrt(-2 * M + tr + np.matrix(te).T) 
print(dists)

return:

4.7 # visual distance matrix

dists = compute_distances(X_test, X_train)
plt.imshow(dists, interpolation='none')
plt.show()

return:

4.8 forecast samples

# Test sample size
num_test = dists.shape[0]
# Initialize test set prediction results
y_pred = np.zeros(num_test) 
# ergodic   
for i in range(num_test):
    # Initialize nearest neighbor list
    closest_y = []
    # Take the index after sorting according to the Euclidean distance matrix, and take the value according to the sorted index with the training set label
# Final flattening list
# Note NP Usage of argsort function
    labels = y_train[np.argsort(dists[i, :])].flatten()
    # Take the nearest k values
    closest_y = labels[0:k]
    # Count the latest k values
    # Here, notice the usage of Counter in the collections module
    c = Counter(closest_y)
    # Take the category with the highest count
    y_pred[i] = c.most_common(1)[0][0] 
print(y_pred)

return:

4.9 check the accuracy

Check the number of actual and predicted matches:

# Find examples of correct predictions
num_correct = np.sum(y_test_pred == y_test)
print(num_correct)

return:

Calculation accuracy:

# Calculation accuracy
accuracy = float(num_correct) / X_test.shape[0]
print('Got %d/%d correct=>accuracy:%f'% (num_correct, X_test.shape[0], accuracy))

return:

4.10 cross validation

# Fold cross validation
num_folds = 5
# Candidate k value
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
X_train_folds = []
y_train_folds = []
# Training data division
X_train_folds = np.array_split(X_train, num_folds)
# Training label Division
y_train_folds = np.array_split(y_train, num_folds)
k_to_accuracies = {}
# Traverse all candidate k values
for k in k_choices:
    # Five fold traversal    
    for fold in range(num_folds): 
        # Separate a verification set for the incoming training set as the test set
        validation_X_test = X_train_folds[fold]
        validation_y_test = y_train_folds[fold]
        temp_X_train = np.concatenate(X_train_folds[:fold] + X_train_folds[fold + 1:])
        temp_y_train = np.concatenate(y_train_folds[:fold] + y_train_folds[fold + 1:])       
        # Calculate distance
        temp_dists = compute_distances(validation_X_test, temp_X_train)
        temp_y_test_pred = predict_labels(temp_y_train, temp_dists, k=k)
        temp_y_test_pred = temp_y_test_pred.reshape((-1, 1))       
        # View classification accuracy
        num_correct = np.sum(temp_y_test_pred == validation_y_test)
        num_test = validation_X_test.shape[0]
        accuracy = float(num_correct) / num_test
        k_to_accuracies[k] = k_to_accuracies.get(k,[]) + [accuracy]

Print the classification accuracy under different k values and different discounts:

# Print the classification accuracy under different k values and different discounts
for k in sorted(k_to_accuracies):    
      for accuracy in k_to_accuracies[k]:
        print('k = %d, accuracy = %f' % (k, accuracy))

return:

Visualization of classification accuracy under different k values and different discounts:

for k in k_choices:
    # Take out the classification accuracy of the k-th k-value
    accuracies = k_to_accuracies[k]
    # Plot scatter plots with different accuracy of k value
    plt.scatter([k] * len(accuracies), accuracies)
# Calculate the mean value of accuracy and sort
accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
# Calculate the standard deviation of accuracy and sort
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
# Draw error bar chart with confidence interval
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
# Drawing title
plt.title('Cross-validation on k')
# x-axis label
plt.xlabel('k')
# y-axis label
plt.ylabel('Cross-validation accuracy')
plt.show()

return:

5. Implementation of scikit learn algorithm

5.1 re realization of the above:

# Import kneigborsclassifier module
from sklearn.neighbors import KNeighborsClassifier
# Create k nearest neighbor instance
neigh = KNeighborsClassifier(n_neighbors=10)
# k-nearest neighbor model fitting
neigh.fit(X_train, y_train)
# k-nearest neighbor model prediction
y_pred = neigh.predict(X_test)
# # Reconstruction of prediction result array
# y_pred = y_pred.reshape((-1, 1))
# Count the number of correct predictions
num_correct = np.sum(y_pred == y_test)
print(num_correct)
# Calculation accuracy
accuracy = float(num_correct) / X_test.shape[0]
print('Got %d / %d correct => accuracy: %f' % (num_correct, X_test.shape[0], accuracy))

return:

5.2 another implementation method

5.2.1 loading data

import pandas as pd
data = pd.read_csv('D:\desktop\knn.csv')
print('dataset shape {}'.format(data.shape))
data.info()

return:

5.2.2 separate training set and test set

X = data.iloc[:, 0:8]
Y = data.iloc[:, 8]
print('shape of X {}, shape of Y {}'.format(X.shape, Y.shape))

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train,Y_test = train_test_split(X, Y, test_size=0.2)

return:

5.2.3 model comparison

The common knn algorithm, weighted knn algorithm and specified radius knn algorithm are used to calculate the score of the data set

from sklearn.neighbors import KNeighborsClassifier, RadiusNeighborsClassifier

# Build three models
models = []
models.append(('KNN', KNeighborsClassifier(n_neighbors=2)))
models.append(('KNN with weights', KNeighborsClassifier(n_neighbors=2, weights='distance')))
models.append(('Radius Neighbors', RadiusNeighborsClassifier(n_neighbors=2, radius=500.0)))

# Train three models respectively and calculate the score
results = []
for name, model in models:
    model.fit(X_train, Y_train)
    results.append((name, model.score(X_test, Y_test)))
for i in range(len(results)):
    print('name: {}; score: {}'.format(results[i][0], results[i][1]))

return:

For the weight algorithm, we choose the closer the distance, the higher the weight. The radius of the radius neighbors classifier model is 500 It can be seen from the output that the ordinary knn algorithm is still the best.

Here comes the question. Is this judgment accurate? The answer is: inaccurate.

Because our training set and test set are randomly assigned, different combinations of training samples and test samples may lead to differences in the accuracy of the calculated algorithm.

So how to solve it?

We can randomly assign the training set and cross validation set for many times, and then calculate the average value of the model score.

Scikit learn provides KFold and cross_val_score() function to handle this problem.

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

results = []
for name, model in models:
    kfold = KFold(n_splits=10)
    cv_result = cross_val_score(model, X, Y, cv=kfold)
    results.append((name, cv_result))
    
for i in range(len(results)):
    print('name: {}; cross_val_score: {}'.format(results[i][0], results[i][1].mean()))

return:

For the above code, we divide the data set into 10 parts through KFold, of which 1 will be used as the cross validation set to calculate the model accuracy, and the remaining 9 will be used as the training set. cross_ val_ The score () function calculates the model scores obtained from the combination of 10 different training sets and cross validation sets, and finally calculates the average value. It seems that the performance of ordinary knn algorithm is better.  

5.2.4} model training and analysis

According to the conclusion obtained from the above model comparison, we then use the ordinary knn algorithm model to train the data set, and check the fitting of the training samples and the prediction accuracy of the test samples:

knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, Y_train)
train_score = knn.score(X_train, Y_train)
test_score = knn.score(X_test, Y_test)
print('train score: {}; test score : {}'.format(train_score, test_score))

return:

From here, we can see two problems.

  • The fitting of training samples is poor, and the score is only more than 0.84, indicating that the algorithm model is too simple to fit the training samples well.
  • The accuracy of the model is not good, and the prediction accuracy is about 0.66.

Let's draw a curve and have a look.

Let's first define the drawing function. The code is as follows:

from sklearn.model_selection import learning_curve
import numpy as np

def plot_learning_curve(plt, estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - An object to be used as a cross-validation generator.
          - An iterable yielding train/test splits.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o--', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

Then we call this function and draw the following figure:

from sklearn.model_selection import ShuffleSplit

knn = KNeighborsClassifier(n_neighbors=2)
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
plt.figure(figsize=(10,6), dpi=200)
plot_learning_curve(plt, knn, 'Learn Curve for KNN Diabetes', X, Y, ylim=(0.0, 1.01), cv=cv)

return:

Topics: Python Algorithm Machine Learning Mathematical Modeling