Simple practice of k-nearest neighbor algorithm

Posted by toyfruit on Sun, 26 Dec 2021 07:48:33 +0100

k-nearest neighbor algorithm (kNN)

Notes on machine learning practice

working principle

There is a sample data set, also known as the training sample set, and each data in the sample set has a label, that is, we know the corresponding relationship between a data in the sample set and its classification. After inputting the new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification label of the data with the most similar feature in the sample set (nearest neighbor). In general, we only select the first k most similar data in the sample data set, which is the source of K in the k-nearest neighbor algorithm. Generally, K is an integer not greater than 20. Finally, the most frequent classification among the k most similar data is selected as the classification of new data.

Nearest neighbor: the minimum value of Euclidean distance from each sample point

General process

  1. Collect data: any method can be used.
  2. Prepare data: the value required for distance calculation, preferably in a structured data format.
  3. Analyze data: any method can be used.
  4. Training algorithm: this step is not applicable to k-nearest neighbor algorithm.
  5. Test algorithm: calculate the error rate.
  6. Using the algorithm: first, input the sample data and structured output results, then run the k-nearest neighbor algorithm to determine which classification the input data belongs to respectively, and finally perform subsequent processing on the calculated classification.

Preparing: importing data and labels

import numpy as np

def createDataSet():
	groups = np.array([[1.0, 1.2], [1.0, 1.0], [0, 0], [0, 0.1]])	#shape = (number of samples, number of features)
	labels = ['A', 'A', 'B', 'B']	#shape = (1, number of samples)
	return groups, labels

k-nearest neighbor

import operator
import numpy as np

def classify0(inX, dataSet, labels, k):
	dataSetSize = dataSet.shape[0]	#Number of samples
	diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet	#Copy inX to the dataSetSize line and subtract it from the corresponding position element of the dataSet to obtain the difference
	sqDiffMat = diffMat ** 2	#Square of each location element
	sqDistances = sqDiffMat.sum(axis = 1)	#Calculate the sum of each row (horizontal)
	distances = sqDistances ** 0.5	#Euclidean distance, open root
	sortedDistIndicies = distances.argsort()	#Sort (index)
	classCount = {}
	for i in range(k):
		voteIlabel = labels[sortedDistIndicies[i]]	#Minimum k distance label values
		classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1	#The number of corresponding tags increases
		# Sort by the number of tags. The sortedClassCount structure is [(tag name, number of tags), (tag name, number of tags)...]
	sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
	return sortedClassCount[0][0]	

Case 1: dating website

Parsing data from a text file

The content format of the text file is: characteristic data 1 \ tcharacteristic data 2\t... \ TTAG \ n

def file2matrix(filename):
	fr = open(filename)
	arrayOLines = fr.readlines()
	numberOfLines = len(arrayOLines)	#The number of rows is the number of samples
	returnMat = np.zeros((numberOfLines, 3))	#Initialize the returned sample set
	classLabelVector = []	#Initialize label
	index = 0	#Record line label
	for line in arrayOLines:
		line = line.strip()	#Remove line breaks at the end of a line \ n
		listFromLine = line.split('\t')	#Separate content by \ t
		returnMat[index, :] = listFromLine[0 : 3]	#The first three data of each row (there are three characteristic data here) are placed in each row of returnMat
		classLabelVector.append(int(listFromLine[-1]))	#Put the tag value in the tag list accordingly
		index += 1	#Add one to the number of rows to proceed to the next row
	fr.close()	#Close file
	return returnMat, classLabelVector

Normalized data

The purpose of normalized data is to make the influence of each characteristic data on the predicted value equal

Formula: normalized data = (original data - min per column) / range per column

def autoNorm(dataSet):
	minVals = dataSet.min(0)	#0 takes the minimum value of each column
	maxVals = dataSet.max(0)	#0 takes the maximum value of each column
	ranges = maxVals - minVals	#Take the data range of each column
	normDataSet = np.zeros(dataSet.shape)	#Result after initialization and normalization
	m = dataSet.shape[0]	#Number of samples
	normDataSet = dataSet - np.tile(minVals, (m, 1))
	normDataSet = normDataSet / np.tile(ranges, (m, 1))	#Normalized, 0-1 range
	return normDataSet, ranges, minVals

Test algorithm

The sample data set is divided into a part as the test set, and the prediction results are compared with the test set label to calculate the error rate

def datingClassTest():
	hoRatio = 0.10	#10% of the sample set randomly selected for the test set
	datingDataMat, datingLabels = file2matrix('datingTestSet.txt')
	normMat, ranges, minVals = autoNorm(datingDataMat)
	m = normMat.shape[0]	#Number of samples in the sample set
	numTestVecs = int(m * hoRatio)	#Number of test set data
	errorCount = 0.0	#Used to calculate the error rate
	for i in range(numTestVecs):
		#classify0(inX, dataSet, labels, k)
		classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :],\
							datingLabels[numTestVecs:m], 30)
		print("the classifier came back with: %d, the real answer is: %d"\
							%(classifierResult, datingLabels[i]))
		if(classifierResult != datingLabels[i]):
			errorCount += 1.0
		print("the total error rate is %f"%(errorCount / float(numTestVecs)))

Use algorithm

Through the user's input, the speculation result is obtained

def classifyPerson():
	resultList = ['not at all', 'in small doses', 'in large doses']	#Text results are obtained by predicting the digital labels
	percentTats = float(input("percentage of time spent playing video games?"))	#Input characteristic data 1
	ffMiles = float(input("frequent flier miles earned per year?"))	#Input characteristic data 2
	iceCream = float(input("liters of ice cream consumed per year?"))	#Input characteristic data 3
	datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
	normMat, ranges, minVals = autoNorm(datingDataMat)
	inArr = np.array([ffMiles, percentTats, iceCream])	#input data
	classifierResult = classify0((inArr - minVals)/ranges, normMat, datingLabels, 3)
	print("you will probably like this person: ", resultList[classifierResult - 1])

Case 2: handwriting recognition system

Original picture data: 32 * 32 black-and-white image matrix represented by 01

Prepare data

Convert 32 * 32 into 1 * 1024 data for subsequent processing

def img2vector(filename):
	returnVect = np.zeros((1, 1024))	#initialization
	fr = open(filename)
	for i in range(32):
		lineStr = fr.readline()
		for j in range(32):
			returnVect[0, 32*i+j] = int(lineStr[j])	#The read data is stored in returnVect
	return returnVect

Test algorithm

Handwritten numeral recognition using k-nearest neighbor algorithm

from os import listdir

def handwritingClassTest():
	hwLabels = []
	trainingFileList = listdir('trainingDigits')	#Read the folder where digital files are stored
	m = len(trainingFileList)	#Number of digital files
	trainingMat = np.zeros((m, 1024))	#Initialize dataset
	for i in range(m):
		fileNameStr = trainingFileList[i]	#Read each file name
		fileStr = fileNameStr.split('.')[0]	#Remove the suffix txt to get the file name
		classNumStr = int(fileStr.split('_')[0])	#Read the label number of each file according to the file name format (number label. txt)
		hwLabels.append(classNumStr)	#Save labels into hwLabels
		trainingMat[i,:] = img2vector('trainingDigits/%s' %fileNameStr)	#Read the data of each file and convert it into 1 * 1024
	testFileList = listdir('testDigits')	#Open the folder where the test data is stored
	errorCount = 0.0
	mTest = len(testFileList)	#Number of test data
	for i in range(mTest):
		fileNameStr = trainingFileList[i]
		fileStr = fileNameStr.split('.')[0]
		classNumStr = int(fileStr.split('_')[0])
		vectorUnderTest = img2vector('testDigits/%s' %fileNameStr)
		classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
		print("the classifier came back with: %d, the real answer is: %d" %(classifierResult, classNumStr))
		if(classifierResult != classNumStr):
			errorCount += 1.0
	print("\nthe total number of errors is: %d" %errorCount)
	print("\nthe total error rate is: %f" %(errorCount/float(mTest)))

Topics: Python Machine Learning