Classification method based on probability theory: Naive Bayes
- Note: all codes in this chapter use Python 3
Advantages: it is still effective in the case of less data, and can deal with multi category problems
Disadvantages: it is sensitive to the preparation method of input data
Applicable data type: nominal
Probability theory is the basis of many machine learning algorithms. This paper starts with the simplest probability classifier, and then gives some assumptions to learn the naive Bayesian classifier. It is called "naive" because the whole formal process only makes the most primitive and simplest assumptions. We make full use of Python's text processing ability to segment documents into word vectors, and then use word vectors to classify documents. We will also build a classifier to observe its filtering effect in real spam data sets. Finally, it introduces how to learn classifiers from a large number of advertisements published by individuals, and transform the learning results into human understandable information.
Classification method based on Bayesian decision theory
naive Bayes is a part of Bayesian decision theory. The core idea of Bayesian decision theory is to select the decision with the highest probability (using the method of probability comparison).
conditional probability
let A and B be two events, and P (A) > 0
P ( B ∣ A ) = P ( A B ) P ( A ) P(B|A)=\dfrac{P(AB)}{P(A)} P(B∣A)=P(A)P(AB)
Is the conditional probability of event B under the condition of event A.
the effective method of calculating conditional probability is called Bayesian criterion. Bayesian criterion tells us how to exchange the conditions and results in conditional probability, that is, if P(x|c) is known and P(c|x) is required, the following calculation method can be used:
P ( c ∣ x ) = P ( A x ∣ c ) P ( c ) p ( x ) P(c|x)=\dfrac{P(Ax|c)P(c)}{p(x)} P(c∣x)=p(x)P(Ax∣c)P(c)
Use conditional probability to classify
what we really need to solve is: given a data point represented by x and y, the data point comes from the category c 1 c_1 What is the probability of c1 (i.e p ( c 1 ∣ x , y ) p(c_1|x,y) p(c1 ∣ x,y)), the data point comes from the category c 2 c_2 What is the probability of c2 ? (i.e p ( c 2 ∣ x , y ) p(c_2|x,y) p(c2∣x,y)). Bayesian criteria can be used to exchange conditions and results in probability. The specific application of Bayesian criteria is as follows:
p ( c i ∣ x , y ) = P ( x , y ∣ c i ) P ( c i ) p ( x , y ) p(c_i|x,y)=\dfrac{P(x,y|c_i)P(c_i)}{p(x,y)} p(ci∣x,y)=p(x,y)P(x,y∣ci)P(ci)
using these definitions, Bayesian classification criteria can be defined as:
- if P ( c 1 ∣ x , y ) > P ( c 2 ∣ x , y ) P(c_1|x,y) > P(c_2|x,y) P(c1 ∣ x,y) > P (C2 ∣ x,y), then it belongs to the category c 1 c_1 c1
- if P ( c 1 ∣ x , y ) < P ( c 2 ∣ x , y ) P(c_1|x,y) < P(c_2|x,y) P(c1 ∣ x,y) < p (C2 ∣ x,y), then it belongs to the category c 2 c_2 c2
using Bayesian criteria, unknown probability values can be calculated from three known probability values.
Document classification using naive Bayes
An important application of machine learning is automatic document classification.
General process of naive Bayes
- Collect data: any method can be used. This article uses RSS feeds
- Prepare data: numeric or boolean data is required
- Analysis of data: when there are a large number of features, the drawing feature has little effect. At this time, the effect of histogram is better
- Training algorithm: calculate the conditional probability of different independent features
- Test algorithm: calculate error rate
- Using algorithms: a common naive Bayesian application is document classification. Naive Bayesian classifier can be used in any classification scenario, not necessarily non text
independence means that the possibility of a feature or a word has nothing to do with its proximity to other words. One assumption of naive Bayes is that the feature flowers are independent of each other, and the other assumption is that each feature is equally important.
Document classification using Python
to get features from text, you need to split the text first. Specific process: the feature comes from the entry (any combination of characters, which can be URL, IP address or other strings), then each text segment is represented as a word vector, then the classifier is constructed, and finally the problems to be considered in the process of realizing naive Bayes.
1. Preparing data: building word vectors from text
treat the text as a word vector or entry vector, that is, convert the sentence into a vector.
2. Training algorithm: calculate the probability from the word vector
rewrite the Bayesian criterion and replace the previous x and y with W. the bold w represents the vector, that is, it is composed of multiple values. In this case. The number of values is the same as the number of words in the vocabulary.
p ( c i ∣ w ) = p ( w ∣ c i ) p ( c i ) p ( w ) p(c_i|\textbf{w})=\dfrac{p(\textbf{w}|c_i)p(c_i)}{p(\textbf{w})} p(ci∣w)=p(w)p(w∣ci)p(ci)
calculate the value for each class using the above formula, and then compare the size of the two probability values.
3. Test algorithm: modify the classifier according to the actual situation
when using Bayesian algorithm to classify documents, it is necessary to calculate the product of multiple probabilities to obtain the probability that the document belongs to a category, that is, calculation
p
(
w
0
∣
1
)
p
(
w
1
∣
1
)
p
(
w
2
∣
1
)
p(w_0|1)p(w_1|1)p(w_2|1)
p(w0∣1)p(w1∣1)p(w2∣1). If one of them is 0, the final product is also 0. To reduce this effect, initialize the occurrence number of all words to 1 and the denominator to 2.
another problem is underflow, which is caused by multiplying too many small numbers. One solution is to take the natural logarithm of the product.
4. Preparing data: document bag model
we take the occurrence of each word as a feature, which can be described as a word set model. If a word appears more than once in the document, it may mean that it contains some information that cannot be expressed by whether the word appears in the document. This method is called word bag model. In the word bag, each word can appear many times, while in the word set, each word can appear only once.
5. Complete sample code
from numpy import * import numpy as np # Define your own dataset def loadDataSet(): postingList = [['my', 'dog', 'has', 'flea', 'problem', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] classVec = [0, 1, 0, 1, 0, 1] # 1 stands for insulting words and 0 stands for normal speech return postingList, classVec # Create a word vector, and each element will not be repeated def createVocabList(dataSet): vocabSet = set([]) # Create an empty collection for document in dataSet: vocabSet = vocabSet | set(document) # Create the union of two sets to prevent repeated words. Each run will have a certain randomness return list(vocabSet) # Output document vector def setOfWords2Vec(vocabList, inputSet): # The parameters are vocabulary list and document returnVec = [0] * len(vocabList) # Create a vector with 0 elements for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] = 1 # If a word in the vocabulary appears, set the corresponding value to 1 else: print("The word: %s is not in my Vocablary!" % word) return returnVec # Returns the position that appears in the vocabulary list vocabList as 1, and the position that does not appear is 0 as the document vector # Naive Bayesian word bag model def bagOfWords2VecMN(vocabList, inputSet): # The parameters are vocabulary list and document returnVec = [0] * len(vocabList) # Create a vector with 0 elements for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] += 1 # If there are words in the vocabulary, the corresponding values are added by 1 return returnVec # It shows that the bagOfWords2VecMN function is the same as the setOfWords2Vec function, # The difference is that whenever a word is encountered, it increases the corresponding value of the word vector instead of setting it to 1 # Naive Bayesian classifier training function def trainNB0(trainMatrix, trainCategory): # trainMatrix is a document matrix, and trainCategory is a vector composed of document labels numTrainDocs = len(trainMatrix) # Obtain the length of the document matrix, that is, the number of documents numWords = len(trainMatrix[0]) # Number of words in a document pAbusive = np.sum(trainCategory) / float(numTrainDocs) # The probability that the document is insulting is calculated. Because the insult is 1, the probability of p(1) is calculated # ----------Change the previous code (start)---------- # p0Num=zeros(numWords) # p1Num=zeros(numWords) # p0Denom=0.0 # p1Denom=0.0 # The above four lines are initialization probabilities # -----------Change previous code (end)----------- p0Num = ones(numWords) p1Num = ones(numWords) p0Denom = 2.0 p1Denom = 2.0 # The above four lines are initialization probabilities for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] # If the document label is an insulting word vector, add this text to the p1 document matrix p1Denom += np.sum(trainMatrix[i]) # Add the number of all insulting words in the sentence else: p0Num += trainMatrix[i] # If the document label is not an insulting word vector, add this text to the p0 document matrix p0Denom += np.sum(trainMatrix[i]) # Add the number of all non insulting words in the sentence # ----------Change the previous code (start)---------- # p1Vect=p1Num/p1Denom # p0Vect=p0Num/p0Denom # -----------Change previous code (end)----------- p1Vect = np.log(p1Num / p1Denom) # When the array is divided by the number, the probability of occurrence of each word: p(1) probability (that is, the probability of occurrence of insulting words) p0Vect = np.log(p0Num / p0Denom) # Probability of occurrence of each word: p(0) probability (that is, the probability of occurrence of non insulting words) return p0Vect, p1Vect, pAbusive # p0Vect non insulting word vector probability, p1Vect insulting word vector probability, pAbusive is the probability of p(1) # Naive Bayesian classification function def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): p1 = np.sum(vec2Classify * p1Vec) + np.log(pClass1) # vec2Classify is multiplied by the insulting word set (corresponding elements are multiplied), and the sum is calculated to calculate its probability p0 = np.sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1) # vec2Classify is multiplied by the non insulting word set and summed to calculate its probability if p1 > p0: return 1 else: return 0 # If the probability is high, which one is returned def testingNB(): listOPosts, listClasses = loadDataSet() # Load data myVocabList = createVocabList(listOPosts) # Create word vector myVocabList (no duplicate elements) trainMat = [] for postinDos in listOPosts: trainMat.append(setOfWords2Vec(myVocabList, postinDos)) # After each round of traversal, set the value of the postInDoc position in myVocabList to 1, and append the formed list to trainMat p0V, p1V, pAB = trainNB0(array(trainMat), array(listClasses)) # trainMat is numpy Ndarray type, to be converted; Similarly, listClasses is the same # Obtain the probability of non insulting word vector, insulting word vector probability and p(1) testEntry = ['love', 'my', 'dalamtion'] thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) # The position of the output document vector testEntry in myVocabList, that is, testEntry is converted into an array of 0,1 phases according to myVocabList print(testEntry, 'classified as:', classifyNB(thisDoc, p0V, p1V, pAB)) # The classifyNB function classifies according to the characteristics and returns the results. The following three lines are the same testEntry = ['stupid', 'garbage'] thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) print(testEntry, 'classified as:', classifyNB(thisDoc, p0V, p1V, pAB)) print(testingNB())
Example: filtering spam using naive Bayes
Example: e-mail classification using naive Bayes
- Collect data: provide text files
- Prepare data: parse the text file into an entry vector
- Analyze data: check entries to ensure the correctness of parsing
- Training algorithm: use the trainNB0 function established before
- Test algorithm: use classifyNB and build a new test function to calculate the error rate of the document
- Using algorithm: build a complete program to classify a group of documents, and input the wrong documents into the screen
1. Preparing data: slicing text
use Python's string Split () method.
2. Test algorithm: cross validation using naive Bayes
the process of randomly selecting part of the data as the training set and the rest as the test set is called retained cross validation
3. Complete code example
import re import numpy as np from numpy import * # Create a word vector, and each element will not be repeated def createVocabList(dataSet): vocabSet = set([]) # Create an empty collection for document in dataSet: vocabSet = vocabSet | set(document) # Create the union of two sets to prevent repeated words. Each run will have a certain randomness return list(vocabSet) # Output document vector def setOfWords2Vec(vocabList, inputSet): # The parameters are vocabulary list and document returnVec = [0] * len(vocabList) # Create a vector with 0 elements for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] = 1 # If a word in the vocabulary appears, set the corresponding value to 1 else: print("The word: %s is not in my Vocablary!" % word) return returnVec # Returns the position that appears in the vocabulary list vocabList as 1, and the position that does not appear is 0 as the document vector # Naive Bayesian classification function def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): p1 = np.sum(vec2Classify * p1Vec) + np.log(pClass1) # vec2Classify is multiplied by the insulting word set (corresponding elements are multiplied), and the sum is calculated to calculate its probability p0 = np.sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1) # vec2Classify is multiplied by the non insulting word set and summed to calculate its probability if p1 > p0: return 1 else: return 0 # If the probability is high, which one is returned # Naive Bayesian classifier training function def trainNB0(trainMatrix, trainCategory): # trainMatrix is a document matrix, and trainCategory is a vector composed of document labels numTrainDocs = len(trainMatrix) # Obtain the length of the document matrix, that is, the number of documents numWords = len(trainMatrix[0]) # Number of words in a document pAbusive = np.sum(trainCategory) / float(numTrainDocs) # The probability that the document is insulting is calculated. Because the insult is 1, the probability of p(1) is calculated p0Num = np.ones(numWords) p1Num = np.ones(numWords) p0Denom = 2.0 p1Denom = 2.0 # The above four lines are initialization probabilities for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] # If the document label is an insulting word vector, add this text to the p1 document matrix p1Denom += np.sum(trainMatrix[i]) # Add the number of all insulting words in the sentence else: p0Num += trainMatrix[i] # If the document label is not an insulting word vector, add this text to the p0 document matrix p0Denom += np.sum(trainMatrix[i]) # Add the number of all non insulting words in the sentence p1Vect = np.log(p1Num / p1Denom) # When the array is divided by the number, the probability of occurrence of each word: p(1) probability (that is, the probability of occurrence of insulting words) p0Vect = np.log(p0Num / p0Denom) # Probability of occurrence of each word: p(0) probability (that is, the probability of occurrence of non insulting words) return p0Vect, p1Vect, pAbusive # p0Vect non insulting word vector probability, p1Vect insulting word vector probability, pAbusive is the probability of p(1) # Slice text def textParse(bigString): listOfTokens = re.split(r'\W+', bigString) # Note the difference between py2 and py3. Py2 uses \ w * to slice words, and py3 uses \ W + to slice words return [tok.lower() for tok in listOfTokens if len(tok) > 2] # lower() turns all word letters into lowercase, excluding punctuation and spaces # Intertwined verification function def spamTest(): docList = [] # Document list initialization classList = [] # Initialization of classification list, that is to judge whether it is a list of insulting remarks. Yes is 1 and no is 0 fullText = [] # Initializes all words that appear in the message for i in range(1, 26): wordList = textParse(open('email/spam/%d.txt' % i).read()) # Read a message and slice it docList.append(wordList) # Add the contents of the message to the docList fullText.extend(wordList) # When appending content, keep the original formatted text of the message classList.append(1) # Classification labels are insulting words # The following four lines are the same as the above four lines, except that the last line has no insulting word label wordList = textParse(open('email/ham/%d.txt' % i).read()) docList.append(wordList) fullText.extend(wordList) classList.append(0) vocabList = createVocabList(docList) # Create word vector (de repeating words) trainingSet = range(50) # Range returns a range object, not an array object testSet = [] # Initialize dataset list for i in range(10): randIndex = int(np.random.uniform(0, len(trainingSet))) # Obtain a random number of 0~len(trainingSet)(50) testSet.append(trainingSet[randIndex]) # Add to the test set, that is, randomly build the training set del(list(trainingSet)[randIndex]) # Delete the selected data to prevent it from being selected again during random selection # Note that py3 is used here, and py2 should be written as del(trainingSet[randIndex]) trainMat = [] trainClasses = [] for docIndex in trainingSet: # trainingSet is 0-49, and docList contains all mail word text trainMat.append(setOfWords2Vec(vocabList, docList[docIndex])) # setOfWords2Vec returns the document vector and appends all word vector information to trainMat trainClasses.append(classList[docIndex]) # Trainclesses and classList information are the same. They are classification labels, alternating between 1 and 0 p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses)) # trainNB0 naive Bayesian classifier trainer to obtain the probability of P(0) and P(1) errorCount = 0 for docIndex in testSet: wordVector = setOfWords2Vec(vocabList, docList[docIndex]) # Obtain the document vector of a certain time to prepare for subsequent classification if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]: # classifyNB indicates the naive Bayesian classification result. If the result of classifyNB is different from the label, the error rate will be increased by one errorCount += 1 print("The error rate is:", float(errorCount)/len(testSet)) spamTest()
Example: using naive Bayesian classifier to obtain regional tendency from personal advertisement
Example: use naive Bayes to find words related to region
- Collect data: to collect content from RSS feeds, you need to build an interface for RSS feeds
- Prepare data: parse the text file into an entry vector
- Analyze data: check entries to ensure the correctness of parsing
- Training algorithm: use the trainNB0 function established before
- Test algorithm: observe the error rate to ensure that the classifier is available. The segmentation program can be modified to reduce the error rate and improve the classification results
- Using algorithm: build a complete program to encapsulate all content. Given two RSS feeds, the program will display the most commonly used public words
1. Collecting data: importing RSS feeds
2. Analyze the data: display the words related to the region
3. Complete sample code display
import feedparser import operator import numpy as np from numpy import * # Create a word vector, and each element will not be repeated def createVocabList(dataSet): vocabSet = set([]) # Create an empty collection for document in dataSet: vocabSet = vocabSet | set(document) # Create the union of two sets to prevent repeated words. Each run will have a certain randomness return list(vocabSet) # Naive Bayesian word bag model def bagOfWords2VecMN(vocabList, inputSet): # The parameters are vocabulary list and document returnVec = [0] * len(vocabList) # Create a vector with 0 elements for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] += 1 # If there are words in the vocabulary, the corresponding values are added by 1 return returnVec # It shows that the bagOfWords2VecMN function is the same as the setOfWords2Vec function, # The difference is that whenever a word is encountered, it increases the corresponding value of the word vector instead of setting it to 1 # Naive Bayesian classifier training function def trainNB0(trainMatrix, trainCategory): # trainMatrix is a document matrix, and trainCategory is a vector composed of document labels numTrainDocs = len(trainMatrix) # Obtain the length of the document matrix, that is, the number of documents numWords = len(trainMatrix[0]) # Number of words in a document pAbusive = np.sum(trainCategory) / float(numTrainDocs) # The probability that the document is insulting is calculated. Because the insult is 1, the probability of p(1) is calculated p0Num = np.ones(numWords) p1Num = np.ones(numWords) p0Denom = 2.0 p1Denom = 2.0 # The above four lines are initialization probabilities for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] # If the document label is an insulting word vector, add this text to the p1 document matrix p1Denom += np.sum(trainMatrix[i]) # Add the number of all insulting words in the sentence else: p0Num += trainMatrix[i] # If the document label is not an insulting word vector, add this text to the p0 document matrix p0Denom += np.sum(trainMatrix[i]) # Add the number of all non insulting words in the sentence p1Vect = np.log(p1Num / p1Denom) # When the array is divided by the number, the probability of occurrence of each word: p(1) probability (that is, the probability of occurrence of insulting words) p0Vect = np.log(p0Num / p0Denom) # Probability of occurrence of each word: p(0) probability (that is, the probability of occurrence of non insulting words) return p0Vect, p1Vect, pAbusive # p0Vect non insulting word vector probability, p1Vect insulting word vector probability, pAbusive is the probability of p(1) # Naive Bayesian classification function def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): p1 = np.sum(vec2Classify * p1Vec) + np.log(pClass1) # vec2Classify is multiplied by the insulting word set (corresponding elements are multiplied), and the sum is calculated to calculate its probability p0 = np.sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1) # vec2Classify is multiplied by the non insulting word set and summed to calculate its probability if p1 > p0: return 1 else: return 0 # If the probability is high, which one is returned # High frequency word sorting def calcMostFreq(vocabList,fullText): # vocabList all words after de duplication and fullText all words before de duplication freqDict={} #Create an empty dictionary to store words and their occurrences for token in vocabList: freqDict[token]=fullText.count(token) # Get the dictionary key and the number of occurrences sortedFreq=sorted(freqDict.items(),key=operator.itemgetter(1),reverse=True) # Sorted (object, sort value, reverse=True/False). The default value of reverse is False, indicating positive order, and reverse=True indicates reverse order. # Sort each pair of keyvalue s in descending order according to the value of each key, operator Itemsetter (1): get the value of the second position in the item return sortedFreq[:30] # Returns the top 30 words with the highest ranking # RSS feed classifier and high frequency word removal def localWords(feed1,feed0): docList=[] # Document list initialization classList=[] # Initialization of classification list, that is to judge whether it is a list of insulting remarks. Yes is 1 and no is 0 fullText=[] # Initialize all words that appear in RSS web content minLen=min(len(feed1['entries']),len(feed0['entries'])) for i in range(minLen): wordList=feedparser.textParser(feed1['entries'][i]['summary']) # Read an RSS web page and slice it. feed1['entries'][i]['summary '] is the specific location of the tag text docList.append(wordList) # Add RSS page content to docList fullText.extend(wordList) # When adding content, keep the original format text of RSS web page content classList.append(1) # Classification labels are insulting words # The following four lines are the same as the above four lines, except that the last line has no insulting word label wordList=feedparser.textParser(feed0['entries'][i]['summary']) docList.append(wordList) fullText.extend(wordList) classList.append(0) vocabList=createVocabList(docList) # Create word vector (de repeating words) top30Words=calcMostFreq(vocabList,fullText) # Get the top 30 words with the most frequent occurrences for pairW in top30Words: if pairW[0] in vocabList: vocabList.remove(pairW[0]) # Remove the top 30 words with the highest number of occurrences (which can be understood as stop Thesaurus) trainingSet=range(2*minLen) # Range returns a range object instead of an array object, which is twice the size of minLen testSet=[] # Initialize dataset list trainMat = [] trainClasses = [] for docIndex in trainingSet: # trainingSet saves the training sample data set, and docList contains all mail word text trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex])) # setOfWords2Vec returns the document vector and appends all word vector information to trainMat trainClasses.append(classList[docIndex]) # Trainclesses and classList information are the same. They are classification labels, alternating between 1 and 0 p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses)) # trainNB0 naive Bayesian classifier trainer to obtain the probability of P(0) and P(1) errorCount = 0 for docIndex in testSet: wordVector = bagOfWords2VecMN(vocabList, docList[docIndex]) # Obtain the document vector of a certain time to prepare for subsequent classification if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]: # classifyNB indicates the naive Bayesian classification result. If the result of classifyNB is different from the label, the error rate will be increased by one errorCount += 1 print("The error rate is:", float(errorCount)/len(testSet)) return vocabList,p0V,p1V # spamTest and localWords are essentially the same, # But the difference is that spamTest accesses local files, while localWords uses RSS feeds # The most representative lexical display function def getTopWords(ny,sf): vocabList,p0V,p1V=localWords(ny,sf) # Obtain the size of p1V and p0V topNY=[] topSF=[] for i in range(len(p0V)): if p0V > -6.0: topSF.append((vocabList[i],p0V[i])) # Probability without insulting words (logarithm of) > - 6.0 words if p1V > -6.0: topSF.append((vocabList[i],p1V[i])) # Probability of insulting words (logarithm of) > - 6.0 words sortedSF=sorted(topSF,key=lambda pair:pair[1],reverse=True) # For sorting, lambda functions are anonymous functions: anonymous functions do not have names, but return function names as function results # Sorted (d.items(), key=lambda x: x [1]): d.items() is the object to be sorted, key=lambda variable: variable [dimension] # The letters x:x [] can be modified at will. The sorting method is sorted according to the dimension in brackets [], [0] is sorted according to the first dimension, # [1] Sort according to the second dimension, [2] sort according to the third dimension, and so on. reverse=True descending sort # reference material: https://blog.csdn.net/weixin_52626164/article/details/116676414 print("*SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF*") for item in sortedSF: print(item[0]) # Output word item of item tuple # The following four lines are the same as the above four lines sortedNY=sorted(topNY,key=lambda pair:pair[1],reverse=True) print("*NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY*") for item in sortedNY: print(item[0]) ny=feedparser.parse('https://newyork.craigslist.org/d/rooms-shares/search/roo?format=rss') sf=feedparser.parse('https://newyork.craigslist.org/d/rooms-shares/search/roo?format=rss') # Note that the above two lines of code did not run through, because they have failed. The author also found relevant alternatives on the Internet, but did not find them # The important thing is to learn the idea of the algorithm and apply it. If you understand the idea, you can use other data for verification print(ny,sf) vocabList,pSF,pNY=localWords(ny,sf) getTopWords(ny,sf)
Summary
for classification, the use of probability is sometimes more effective than the use of hard rules. Bayesian probability and Bayesian criterion provide an effective method to estimate unknown probability by using known values.
the demand for data can be reduced through the assumption of conditional independence between features. The independence hypothesis means that the occurrence probability of a word does not depend on other words in the document. Although the independence hypothesis is not correct, park tree Bayes is still an effective classifier. Considerations: the lower overflow can be solved by pairing the probability; Word bag model is better than word set model in solving the problem of document classification.