1. For example:
(1) I want to locate myself. I can ask five people where to judge the distance between us
Judge your distance by your neighbors
(2) Seven films are given to judge whether they are love films or action films, the distance between one film and other films, and which type they belong to.
Through the distance from the unknown film, we can judge what kind of film it belongs to, so the focus of our search is how to find the distance?
2. Definition:
If most of the k most similar (i.e. the nearest) samples in the feature space belong to a category, the sample also belongs to this category
Calculation distance formula: sum and root of Euclidean distance square
[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-pyxitsae-1641829375181) (C: \ users \ Dell \ appdata \ roaming \ typora \ user images \ image-20220110194719036. PNG)]
For similar samples, the values between features should be similar.
Note: the k-nearest neighbor algorithm needs to be standardized to avoid a feature having too much impact on the whole.
3.api
API: sklearn.neighbors.KNeighborsClassifier(n_neighbors=5,algorithm='auto')
n_neighbors=5: default number of neighbors [will affect the result]
Algorithm: the algorithm used by default
4.k proximity algorithm example - predict where a person wants to register
(1) Example: predicted occupancy location [classification problem]
Demand: predict where a person will register.
Eigenvalues: x,y coordinates, positioning accuracy, time
Target value: place_ ID (ID of check-in location)
(2) Handling:
Due to the large amount of data, xy reduction saves time
The timestamp is processed (month, day, week, hour, minute and second) as a new feature
Thousands to tens of thousands of locations with less than the specified number of check-in persons are deleted
(3) Code
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import StandardScaler def knncls(): ''' K-Nearest neighbor predicted user check-in location :return:None ''' # Read data # 1. Narrow data filtering data = pd.read_csv('./data/train.csv') print(data.head(10)) # Processing data data = data.query("x>1.0 & x<1.25 & y>2.5 &y <2.75") # Processing time data time_value = pd.to_datetime(data['time'], unit='s') print(time_value) # Convert date format to dictionary format time_value = pd.DatetimeIndex(time_value) # Construct some features data['day'] = time_value.day data['hour'] = time_value.hour data['weekday'] = time_value.weekday # Delete timestamp feature data = data.drop(['time'], axis=1) print(data) # Delete the target location where the signed in quantity is less than n place_count = data.groupby('place_id').count() tf = place_count[place_count.row_id > 3].reset_index() data = data[data['place_id'].isin(tf.place_id)] # Extract the characteristic value and target value in the data y = data['place_id'] x = data.drop(['place_id'], axis=1) # Test set for data segmentation training set x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25) # Feature Engineering (Standardization) std = StandardScaler() # The eigenvalues of test set and training set are standardized x_train = std.fit_transform(x_train) x_test = std.transform(x_test) # Perform algorithm flow knn = KNeighborsClassifier(n_neighbors=5) # fit predict score knn.fit(x_train, y_train) # Get the prediction results y_predict = knn.predict(x_test) print("The predicted target check-in location is:", y_predict) # Get accuracy print("Prediction accuracy", knn.score(x_test, y_test)) return None if __name__ == '__main__': knncls()
5. Summary:
Question:
What is the value of k? What's the impact?
The value of k is very small: it is easy to be affected by outliers
The value of k is very large: it is vulnerable to the fluctuation of the quantity (category) of k value
Performance issues?
High time complexity
Advantages: easy to understand, no need to estimate parameters (parameters in the algorithm), no training, no iteration
Disadvantages: the computational memory overhead is large, the k value is specified, and the k value is improperly selected, and the classification accuracy cannot be guaranteed
Usage scenario: small data scenario, thousands to tens of thousands