AiLearning Kapitel 2 k-nearest neighbor algorithm (dating website)

Posted by vmavrou on Tue, 21 Dec 2021 18:29:11 +0100

preface

Taking CSDN as a learning note has no commercial use. I hope to communicate with you and make progress together.
Learning materials are from Github: link

Modules and functions

Functions referenced in Github source program

from numpy import *
import operator

Text file processing

Helen appointment text file datingTestSet2.txt
Because many data commands are involved in the follow-up, the data change process is displayed, which is the original data

for line in fr.readlines():
    print(line)
-->
40920	8.326976	0.953952	3

14488	7.153469	1.673904	2

Only two lines of data are shown here. The data type is string, and each line exists as a separate string
about zeros
In the process of processing files, the handle (fr) it needs to be generated twice, otherwise the loop statement will have no output
after .strip Data obtained after command

for line in fr.readlines():
    line = line.strip()
    print(line)
-->
40920	8.326976	0.953952	3
14488	7.153469	1.673904	2

As you can see, the empty lines between lines disappear because the chars option in the strip command is empty, that is, the empty characters between lines are eliminated
after .split Data obtained after command

for line in fr.readlines():
    line = line.strip()
    line = line.split('\t')
    print(line)
-->
['40920', '8.326976', '0.953952', '3']
['14488', '7.153469', '1.673904', '2']

Because after After the strip command, the data in a single line is separated by tabs. Therefore, each string is divided by tabs ('\ t'). The data type of each line is changed from string to list
Results of data processing steps

print(type(returnMat))
print(returnMat)
-->
<class 'numpy.ndarray'>
[[4.0920000e+04 8.3269760e+00 9.5395200e-01]
 [1.4488000e+04 7.1534690e+00 1.6739040e+00]
 [2.6052000e+04 1.4418710e+00 8.0512400e-01]
 ...

The variable returnMat is a ndarray type

print(type(classLabelVector))
print(classLabelVector)
-->
<class 'list'>
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3, 2, 1, ...]

The variable classLabelVektor is an integer list

Normalization processing

The variable dataSet of the function autoNorm is the returnMat after data processing
adopt . max and min Command to obtain the maximum and minimum values of a single indicator. (0) is a column element and (1) is a row element

print(returnMat.min(0))
print(type(returnMat.min(0)))
-->
[0.       0.       0.001156]
<class 'numpy.ndarray'>

According to the evaluation of the date, there are three indicators, so the array created is two-dimensional, so it passes max and The min command does not get a number, but a two-dimensional array
adopt shape The command obtains the dimension of the dataSet, and then creates a blank array normDataSet. shape(0) reads the dimension of the first dimension, which is the number of rows of the dataSet, and so on
Unlike zeros, tile It can improve the richness of creating arrays
The idea of normalization in this step is to generate data with the same dimension as the dataSet through the extreme value and range obtained from the data, so as to carry out overall subtraction and division, and promote the operation of single value to array operation

Algorithm pseudo code

First introduce the meaning of each parameter of the function classify0 to facilitate understanding the structure of the function
inX: a data individual, that is, the two-dimensional data of a dating object
dataSet: the number of individuals in the training sample
Labels: labels of training samples, consistent with dataSet
k: First k data
Array power and matrix power **
.sum Function defines the dimension of addition through axis
.argsort Returns the sorted index list

voteIlabel = labels[sortedDistIndicies[i]]

Extract the tag of the ith individual

classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 

classCount is defined as a dictionary. This code actually counts tags 1, 2 and 3 through the dictionary, .get Method to extract the result of the last count and add one, and the default value is set to 0 to give the initial value

print(classCount)
-->
{1: 5, 3: 23, 2: 2}

Simply set the parameters and run the classify0 function to get the classCount as above. Count the labels 1, 2 and 3 within the k range respectively

sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)

method .iteritems It's Python 2 Methods in X, python 3 X is simplified to item,

print(classCount.items())
-->
dict_items([(1, 5), (3, 23), (2, 2)])

Pass by The classCount processed by item returns a list iterator, and each key and corresponding value form a tuple
.itemgetter This is the method to extract a dimension data of the array under the operator module. If the default value is 1, the second value of the tuple will be extracted. Using the above example, 5, 23 and 2 will be extracted respectively for sorting. If reverse is True, it is arranged in descending order. After processing, the result is:

print(sortedClassCount)
-->
[(3, 23), (1, 5), (2, 2)]

Finally, the function returns sortedClassCount[0][0], and outputs the tags with the most similar data between the classified individuals and the first K data. Here is 3

Test code

Here, the data is divided into two groups through numTestVecs. One group is the number of test samples before numTestVecs and the other group is after numTestVecs, that is, numTestVecs:m is the number of test samples
If you understand the previous step, this part should be easy
The parentheses are missing from the print in the learning materials. Just add them yourself

Dating site prediction function

If it is Python 3 The environment of X should be raw_input changed to input

summary

The principle of k-nearest neighbor algorithm is easy to understand. Personally, I think the difficulty of this project lies in the progressive processing of data. Therefore, I show the results of each step of data processing. Some of the results come from the parameters I set, just to show the data structure and facilitate myself and everyone to better understand and learn this code.
Thank you very much for the learning materials written by the authors cited in this article. Because there are many, they are not listed one by one. I hope you can learn and make progress together. I hope you can point out the shortcomings.

Topics: Algorithm Machine Learning