k-nearest neighbor algorithm (kNN)
Notes on machine learning practice
working principle
There is a sample data set, also known as the training sample set, and each data in the sample set has a label, that is, we know the corresponding relationship between a data in the sample set and its classification. After inputting the new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification label of the data with the most similar feature in the sample set (nearest neighbor). In general, we only select the first k most similar data in the sample data set, which is the source of K in the k-nearest neighbor algorithm. Generally, K is an integer not greater than 20. Finally, the most frequent classification among the k most similar data is selected as the classification of new data.
Nearest neighbor: the minimum value of Euclidean distance from each sample point
General process
- Collect data: any method can be used.
- Prepare data: the value required for distance calculation, preferably in a structured data format.
- Analyze data: any method can be used.
- Training algorithm: this step is not applicable to k-nearest neighbor algorithm.
- Test algorithm: calculate the error rate.
- Using the algorithm: first, input the sample data and structured output results, then run the k-nearest neighbor algorithm to determine which classification the input data belongs to respectively, and finally perform subsequent processing on the calculated classification.
Preparing: importing data and labels
import numpy as np def createDataSet(): groups = np.array([[1.0, 1.2], [1.0, 1.0], [0, 0], [0, 0.1]]) #shape = (number of samples, number of features) labels = ['A', 'A', 'B', 'B'] #shape = (1, number of samples) return groups, labels
k-nearest neighbor
import operator import numpy as np def classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] #Number of samples diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet #Copy inX to the dataSetSize line and subtract it from the corresponding position element of the dataSet to obtain the difference sqDiffMat = diffMat ** 2 #Square of each location element sqDistances = sqDiffMat.sum(axis = 1) #Calculate the sum of each row (horizontal) distances = sqDistances ** 0.5 #Euclidean distance, open root sortedDistIndicies = distances.argsort() #Sort (index) classCount = {} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] #Minimum k distance label values classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1 #The number of corresponding tags increases # Sort by the number of tags. The sortedClassCount structure is [(tag name, number of tags), (tag name, number of tags)...] sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True) return sortedClassCount[0][0]
Case 1: dating website
Parsing data from a text file
The content format of the text file is: characteristic data 1 \ tcharacteristic data 2\t... \ TTAG \ n
def file2matrix(filename): fr = open(filename) arrayOLines = fr.readlines() numberOfLines = len(arrayOLines) #The number of rows is the number of samples returnMat = np.zeros((numberOfLines, 3)) #Initialize the returned sample set classLabelVector = [] #Initialize label index = 0 #Record line label for line in arrayOLines: line = line.strip() #Remove line breaks at the end of a line \ n listFromLine = line.split('\t') #Separate content by \ t returnMat[index, :] = listFromLine[0 : 3] #The first three data of each row (there are three characteristic data here) are placed in each row of returnMat classLabelVector.append(int(listFromLine[-1])) #Put the tag value in the tag list accordingly index += 1 #Add one to the number of rows to proceed to the next row fr.close() #Close file return returnMat, classLabelVector
Normalized data
The purpose of normalized data is to make the influence of each characteristic data on the predicted value equal
Formula: normalized data = (original data - min per column) / range per column
def autoNorm(dataSet): minVals = dataSet.min(0) #0 takes the minimum value of each column maxVals = dataSet.max(0) #0 takes the maximum value of each column ranges = maxVals - minVals #Take the data range of each column normDataSet = np.zeros(dataSet.shape) #Result after initialization and normalization m = dataSet.shape[0] #Number of samples normDataSet = dataSet - np.tile(minVals, (m, 1)) normDataSet = normDataSet / np.tile(ranges, (m, 1)) #Normalized, 0-1 range return normDataSet, ranges, minVals
Test algorithm
The sample data set is divided into a part as the test set, and the prediction results are compared with the test set label to calculate the error rate
def datingClassTest(): hoRatio = 0.10 #10% of the sample set randomly selected for the test set datingDataMat, datingLabels = file2matrix('datingTestSet.txt') normMat, ranges, minVals = autoNorm(datingDataMat) m = normMat.shape[0] #Number of samples in the sample set numTestVecs = int(m * hoRatio) #Number of test set data errorCount = 0.0 #Used to calculate the error rate for i in range(numTestVecs): #classify0(inX, dataSet, labels, k) classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :],\ datingLabels[numTestVecs:m], 30) print("the classifier came back with: %d, the real answer is: %d"\ %(classifierResult, datingLabels[i])) if(classifierResult != datingLabels[i]): errorCount += 1.0 print("the total error rate is %f"%(errorCount / float(numTestVecs)))
Use algorithm
Through the user's input, the speculation result is obtained
def classifyPerson(): resultList = ['not at all', 'in small doses', 'in large doses'] #Text results are obtained by predicting the digital labels percentTats = float(input("percentage of time spent playing video games?")) #Input characteristic data 1 ffMiles = float(input("frequent flier miles earned per year?")) #Input characteristic data 2 iceCream = float(input("liters of ice cream consumed per year?")) #Input characteristic data 3 datingDataMat, datingLabels = file2matrix('datingTestSet2.txt') normMat, ranges, minVals = autoNorm(datingDataMat) inArr = np.array([ffMiles, percentTats, iceCream]) #input data classifierResult = classify0((inArr - minVals)/ranges, normMat, datingLabels, 3) print("you will probably like this person: ", resultList[classifierResult - 1])
Case 2: handwriting recognition system
Original picture data: 32 * 32 black-and-white image matrix represented by 01
Prepare data
Convert 32 * 32 into 1 * 1024 data for subsequent processing
def img2vector(filename): returnVect = np.zeros((1, 1024)) #initialization fr = open(filename) for i in range(32): lineStr = fr.readline() for j in range(32): returnVect[0, 32*i+j] = int(lineStr[j]) #The read data is stored in returnVect return returnVect
Test algorithm
Handwritten numeral recognition using k-nearest neighbor algorithm
from os import listdir def handwritingClassTest(): hwLabels = [] trainingFileList = listdir('trainingDigits') #Read the folder where digital files are stored m = len(trainingFileList) #Number of digital files trainingMat = np.zeros((m, 1024)) #Initialize dataset for i in range(m): fileNameStr = trainingFileList[i] #Read each file name fileStr = fileNameStr.split('.')[0] #Remove the suffix txt to get the file name classNumStr = int(fileStr.split('_')[0]) #Read the label number of each file according to the file name format (number label. txt) hwLabels.append(classNumStr) #Save labels into hwLabels trainingMat[i,:] = img2vector('trainingDigits/%s' %fileNameStr) #Read the data of each file and convert it into 1 * 1024 testFileList = listdir('testDigits') #Open the folder where the test data is stored errorCount = 0.0 mTest = len(testFileList) #Number of test data for i in range(mTest): fileNameStr = trainingFileList[i] fileStr = fileNameStr.split('.')[0] classNumStr = int(fileStr.split('_')[0]) vectorUnderTest = img2vector('testDigits/%s' %fileNameStr) classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3) print("the classifier came back with: %d, the real answer is: %d" %(classifierResult, classNumStr)) if(classifierResult != classNumStr): errorCount += 1.0 print("\nthe total number of errors is: %d" %errorCount) print("\nthe total error rate is: %f" %(errorCount/float(mTest)))