k-nearest neighbor algorithm (facebook predicts occupancy location)

Posted by drucifer on Mon, 10 Jan 2022 16:50:02 +0100

1. For example:

(1) I want to locate myself. I can ask five people where to judge the distance between us

Judge your distance by your neighbors

(2) Seven films are given to judge whether they are love films or action films, the distance between one film and other films, and which type they belong to.

Through the distance from the unknown film, we can judge what kind of film it belongs to, so the focus of our search is how to find the distance?

2. Definition:

If most of the k most similar (i.e. the nearest) samples in the feature space belong to a category, the sample also belongs to this category

Calculation distance formula: sum and root of Euclidean distance square

For similar samples, the values between features should be similar.

Note: the k-nearest neighbor algorithm needs to be standardized to avoid a feature having too much impact on the whole.


API: sklearn.neighbors.KNeighborsClassifier(n_neighbors=5,algorithm='auto')

n_neighbors=5: default number of neighbors [will affect the result]

Algorithm: the algorithm used by default

4.k proximity algorithm example - predict where a person wants to register

(1) Example: predicted occupancy location [classification problem]

Demand: predict where a person will register.

Eigenvalues: x,y coordinates, positioning accuracy, time

Target value: place_ ID (ID of check-in location)

(2) Handling:

Due to the large amount of data, xy reduction saves time

The timestamp is processed (month, day, week, hour, minute and second) as a new feature

Thousands to tens of thousands of locations with less than the specified number of check-in persons are deleted

(3) Code

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

def knncls():
    K-Nearest neighbor predicted user check-in location
    # Read data
    # 1. Narrow data filtering
    data = pd.read_csv('./data/train.csv')
    # Processing data
    data = data.query("x>1.0 & x<1.25 & y>2.5 &y <2.75")
    # Processing time data
    time_value = pd.to_datetime(data['time'], unit='s')
    # Convert date format to dictionary format
    time_value = pd.DatetimeIndex(time_value)
    # Construct some features
    data['day'] = time_value.day
    data['hour'] = time_value.hour
    data['weekday'] = time_value.weekday

    # Delete timestamp feature
    data = data.drop(['time'], axis=1)

    # Delete the target location where the signed in quantity is less than n
    place_count = data.groupby('place_id').count()
    tf = place_count[place_count.row_id > 3].reset_index()
    data = data[data['place_id'].isin(tf.place_id)]

    # Extract the characteristic value and target value in the data
    y = data['place_id']
    x = data.drop(['place_id'], axis=1)

    # Test set for data segmentation training set
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

    # Feature Engineering (Standardization)
    std = StandardScaler()
    # The eigenvalues of test set and training set are standardized
    x_train = std.fit_transform(x_train)
    x_test = std.transform(x_test)
    # Perform algorithm flow
    knn = KNeighborsClassifier(n_neighbors=5)
    # fit predict score
    knn.fit(x_train, y_train)
    # Get the prediction results
    y_predict = knn.predict(x_test)
    print("The predicted target check-in location is:", y_predict)
    # Get accuracy
    print("Prediction accuracy", knn.score(x_test, y_test))

    return None
if __name__ == '__main__':

5. Summary:


What is the value of k? What's the impact?

The value of k is very small: it is easy to be affected by outliers

The value of k is very large: it is vulnerable to the fluctuation of the quantity (category) of k value

Performance issues?

High time complexity

Advantages: easy to understand, no need to estimate parameters (parameters in the algorithm), no training, no iteration

Disadvantages: the computational memory overhead is large, the k value is specified, and the k value is improperly selected, and the classification accuracy cannot be guaranteed

Usage scenario: small data scenario, thousands to tens of thousands

Topics: Machine Learning sklearn facebook