Machine learning practice Chapter 4

Posted by hinchcliffe on Fri, 31 Dec 2021 17:06:10 +0100

Classification method based on probability theory: Naive Bayes

Note: all codes in this chapter use Python 3

Advantages: it is still effective in the case of less data, and can deal with multi category problems
Disadvantages: it is sensitive to the preparation method of input data
Applicable data type: nominal

Probability theory is the basis of many machine learning algorithms. This paper starts with the simplest probability classifier, and then gives some assumptions to learn the naive Bayesian classifier. It is called "naive" because the whole formal process only makes the most primitive and simplest assumptions. We make full use of Python's text processing ability to segment documents into word vectors, and then use word vectors to classify documents. We will also build a classifier to observe its filtering effect in real spam data sets. Finally, it introduces how to learn classifiers from a large number of advertisements published by individuals, and transform the learning results into human understandable information.

Classification method based on Bayesian decision theory

naive Bayes is a part of Bayesian decision theory. The core idea of Bayesian decision theory is to select the decision with the highest probability (using the method of probability comparison).

conditional probability

let A and B be two events, and P (A) > 0

P ( B ∣ A ) = P ( A B ) P ( A ) P(B|A)=\dfrac{P(AB)}{P(A)} P(B∣A)=P(A)P(AB)

Is the conditional probability of event B under the condition of event A.

the effective method of calculating conditional probability is called Bayesian criterion. Bayesian criterion tells us how to exchange the conditions and results in conditional probability, that is, if P(x|c) is known and P(c|x) is required, the following calculation method can be used:

P ( c ∣ x ) = P ( A x ∣ c ) P ( c ) p ( x ) P(c|x)=\dfrac{P(Ax|c)P(c)}{p(x)} P(c∣x)=p(x)P(Ax∣c)P(c)

Use conditional probability to classify

what we really need to solve is: given a data point represented by x and y, the data point comes from the category c 1 c_1 What is the probability of c1 (i.e p ( c 1 ∣ x , y ) p(c_1|x,y) p(c1 ∣ x,y)), the data point comes from the category c 2 c_2 What is the probability of c2 ？ (i.e p ( c 2 ∣ x , y ) p(c_2|x,y) p(c2∣x,y)). Bayesian criteria can be used to exchange conditions and results in probability. The specific application of Bayesian criteria is as follows:

p ( c i ∣ x , y ) = P ( x , y ∣ c i ) P ( c i ) p ( x , y ) p(c_i|x,y)=\dfrac{P(x,y|c_i)P(c_i)}{p(x,y)} p(ci∣x,y)=p(x,y)P(x,y∣ci)P(ci)

using these definitions, Bayesian classification criteria can be defined as:

if P ( c 1 ∣ x , y ) > P ( c 2 ∣ x , y ) P(c_1|x,y) > P(c_2|x,y) P(c1 ∣ x,y) > P (C2 ∣ x,y), then it belongs to the category c 1 c_1 c1
if P ( c 1 ∣ x , y ) < P ( c 2 ∣ x , y ) P(c_1|x,y) < P(c_2|x,y) P(c1 ∣ x,y) < p (C2 ∣ x,y), then it belongs to the category c 2 c_2 c2

using Bayesian criteria, unknown probability values can be calculated from three known probability values.

Document classification using naive Bayes

An important application of machine learning is automatic document classification.

General process of naive Bayes

Collect data: any method can be used. This article uses RSS feeds
Prepare data: numeric or boolean data is required
Analysis of data: when there are a large number of features, the drawing feature has little effect. At this time, the effect of histogram is better
Training algorithm: calculate the conditional probability of different independent features
Test algorithm: calculate error rate
Using algorithms: a common naive Bayesian application is document classification. Naive Bayesian classifier can be used in any classification scenario, not necessarily non text

independence means that the possibility of a feature or a word has nothing to do with its proximity to other words. One assumption of naive Bayes is that the feature flowers are independent of each other, and the other assumption is that each feature is equally important.

Document classification using Python

to get features from text, you need to split the text first. Specific process: the feature comes from the entry (any combination of characters, which can be URL, IP address or other strings), then each text segment is represented as a word vector, then the classifier is constructed, and finally the problems to be considered in the process of realizing naive Bayes.

1. Preparing data: building word vectors from text

treat the text as a word vector or entry vector, that is, convert the sentence into a vector.

2. Training algorithm: calculate the probability from the word vector

rewrite the Bayesian criterion and replace the previous x and y with W. the bold w represents the vector, that is, it is composed of multiple values. In this case. The number of values is the same as the number of words in the vocabulary.

p ( c i ∣ w ) = p ( w ∣ c i ) p ( c i ) p ( w ) p(c_i|\textbf{w})=\dfrac{p(\textbf{w}|c_i)p(c_i)}{p(\textbf{w})} p(ci∣w)=p(w)p(w∣ci)p(ci)

calculate the value for each class using the above formula, and then compare the size of the two probability values.

3. Test algorithm: modify the classifier according to the actual situation

when using Bayesian algorithm to classify documents, it is necessary to calculate the product of multiple probabilities to obtain the probability that the document belongs to a category, that is, calculation p ( w 0 ∣ 1 ) p ( w 1 ∣ 1 ) p ( w 2 ∣ 1 ) p(w_0|1)p(w_1|1)p(w_2|1) p(w0∣1)p(w1∣1)p(w2∣1). If one of them is 0, the final product is also 0. To reduce this effect, initialize the occurrence number of all words to 1 and the denominator to 2.
another problem is underflow, which is caused by multiplying too many small numbers. One solution is to take the natural logarithm of the product.

4. Preparing data: document bag model

we take the occurrence of each word as a feature, which can be described as a word set model. If a word appears more than once in the document, it may mean that it contains some information that cannot be expressed by whether the word appears in the document. This method is called word bag model. In the word bag, each word can appear many times, while in the word set, each word can appear only once.

5. Complete sample code

from numpy import *
import numpy as np

# Define your own dataset
def loadDataSet():
    postingList = [['my', 'dog', 'has', 'flea', 'problem', 'help', 'please'],
                   ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                   ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                   ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                   ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                   ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0, 1, 0, 1, 0, 1]
    # 1 stands for insulting words and 0 stands for normal speech
    return postingList, classVec


# Create a word vector, and each element will not be repeated
def createVocabList(dataSet):
    vocabSet = set([])
    # Create an empty collection
    for document in dataSet:
        vocabSet = vocabSet | set(document)
        # Create the union of two sets to prevent repeated words. Each run will have a certain randomness
    return list(vocabSet)


# Output document vector
def setOfWords2Vec(vocabList, inputSet):
    # The parameters are vocabulary list and document
    returnVec = [0] * len(vocabList)
    # Create a vector with 0 elements
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        # If a word in the vocabulary appears, set the corresponding value to 1
        else:
            print("The word: %s is not in my Vocablary!" % word)
    return returnVec
    # Returns the position that appears in the vocabulary list vocabList as 1, and the position that does not appear is 0 as the document vector


# Naive Bayesian word bag model
def bagOfWords2VecMN(vocabList, inputSet):
    # The parameters are vocabulary list and document
    returnVec = [0] * len(vocabList)
    # Create a vector with 0 elements
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
            # If there are words in the vocabulary, the corresponding values are added by 1
    return returnVec
# It shows that the bagOfWords2VecMN function is the same as the setOfWords2Vec function,
# The difference is that whenever a word is encountered, it increases the corresponding value of the word vector instead of setting it to 1


# Naive Bayesian classifier training function
def trainNB0(trainMatrix, trainCategory):
    # trainMatrix is a document matrix, and trainCategory is a vector composed of document labels
    numTrainDocs = len(trainMatrix)
    # Obtain the length of the document matrix, that is, the number of documents
    numWords = len(trainMatrix[0])
    # Number of words in a document
    pAbusive = np.sum(trainCategory) / float(numTrainDocs)
    # The probability that the document is insulting is calculated. Because the insult is 1, the probability of p(1) is calculated
    # ----------Change the previous code (start)----------
    # p0Num=zeros(numWords)
    # p1Num=zeros(numWords)
    # p0Denom=0.0
    # p1Denom=0.0
    # The above four lines are initialization probabilities
    # -----------Change previous code (end)-----------
    p0Num = ones(numWords)
    p1Num = ones(numWords)
    p0Denom = 2.0
    p1Denom = 2.0
    # The above four lines are initialization probabilities
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            # If the document label is an insulting word vector, add this text to the p1 document matrix
            p1Denom += np.sum(trainMatrix[i])
            # Add the number of all insulting words in the sentence
        else:
            p0Num += trainMatrix[i]
            # If the document label is not an insulting word vector, add this text to the p0 document matrix
            p0Denom += np.sum(trainMatrix[i])
            # Add the number of all non insulting words in the sentence
    # ----------Change the previous code (start)----------
    # p1Vect=p1Num/p1Denom
    # p0Vect=p0Num/p0Denom
    # -----------Change previous code (end)-----------
    p1Vect = np.log(p1Num / p1Denom)
    # When the array is divided by the number, the probability of occurrence of each word: p(1) probability (that is, the probability of occurrence of insulting words)
    p0Vect = np.log(p0Num / p0Denom)
    # Probability of occurrence of each word: p(0) probability (that is, the probability of occurrence of non insulting words)
    return p0Vect, p1Vect, pAbusive
    # p0Vect non insulting word vector probability, p1Vect insulting word vector probability, pAbusive is the probability of p(1)


# Naive Bayesian classification function
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = np.sum(vec2Classify * p1Vec) + np.log(pClass1)
    # vec2Classify is multiplied by the insulting word set (corresponding elements are multiplied), and the sum is calculated to calculate its probability
    p0 = np.sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1)
    # vec2Classify is multiplied by the non insulting word set and summed to calculate its probability
    if p1 > p0:
        return 1
    else:
        return 0
    # If the probability is high, which one is returned


def testingNB():
    listOPosts, listClasses = loadDataSet()
    # Load data 
    myVocabList = createVocabList(listOPosts)
    # Create word vector myVocabList (no duplicate elements)
    trainMat = []
    for postinDos in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDos))
    # After each round of traversal, set the value of the postInDoc position in myVocabList to 1, and append the formed list to trainMat
    p0V, p1V, pAB = trainNB0(array(trainMat), array(listClasses))
    # trainMat is numpy Ndarray type, to be converted; Similarly, listClasses is the same
    # Obtain the probability of non insulting word vector, insulting word vector probability and p(1)
    testEntry = ['love', 'my', 'dalamtion']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    # The position of the output document vector testEntry in myVocabList, that is, testEntry is converted into an array of 0,1 phases according to myVocabList
    print(testEntry, 'classified as:', classifyNB(thisDoc, p0V, p1V, pAB))
    # The classifyNB function classifies according to the characteristics and returns the results. The following three lines are the same
    testEntry = ['stupid', 'garbage']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, 'classified as:', classifyNB(thisDoc, p0V, p1V, pAB))

print(testingNB())

Example: filtering spam using naive Bayes

Example: e-mail classification using naive Bayes

Collect data: provide text files
Prepare data: parse the text file into an entry vector
Analyze data: check entries to ensure the correctness of parsing
Training algorithm: use the trainNB0 function established before
Test algorithm: use classifyNB and build a new test function to calculate the error rate of the document
Using algorithm: build a complete program to classify a group of documents, and input the wrong documents into the screen

1. Preparing data: slicing text

use Python's string Split () method.

2. Test algorithm: cross validation using naive Bayes

the process of randomly selecting part of the data as the training set and the rest as the test set is called retained cross validation

3. Complete code example

import re
import numpy as np
from numpy import *

# Create a word vector, and each element will not be repeated
def createVocabList(dataSet):
    vocabSet = set([])
    # Create an empty collection
    for document in dataSet:
        vocabSet = vocabSet | set(document)
        # Create the union of two sets to prevent repeated words. Each run will have a certain randomness
    return list(vocabSet)

# Output document vector
def setOfWords2Vec(vocabList, inputSet):
    # The parameters are vocabulary list and document
    returnVec = [0] * len(vocabList)
    # Create a vector with 0 elements
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        # If a word in the vocabulary appears, set the corresponding value to 1
        else:
            print("The word: %s is not in my Vocablary!" % word)
    return returnVec
    # Returns the position that appears in the vocabulary list vocabList as 1, and the position that does not appear is 0 as the document vector


# Naive Bayesian classification function
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = np.sum(vec2Classify * p1Vec) + np.log(pClass1)
    # vec2Classify is multiplied by the insulting word set (corresponding elements are multiplied), and the sum is calculated to calculate its probability
    p0 = np.sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1)
    # vec2Classify is multiplied by the non insulting word set and summed to calculate its probability
    if p1 > p0:
        return 1
    else:
        return 0
    # If the probability is high, which one is returned


# Naive Bayesian classifier training function
def trainNB0(trainMatrix, trainCategory):
    # trainMatrix is a document matrix, and trainCategory is a vector composed of document labels
    numTrainDocs = len(trainMatrix)
    # Obtain the length of the document matrix, that is, the number of documents
    numWords = len(trainMatrix[0])
    # Number of words in a document
    pAbusive = np.sum(trainCategory) / float(numTrainDocs)
    # The probability that the document is insulting is calculated. Because the insult is 1, the probability of p(1) is calculated
    p0Num = np.ones(numWords)
    p1Num = np.ones(numWords)
    p0Denom = 2.0
    p1Denom = 2.0
    # The above four lines are initialization probabilities
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            # If the document label is an insulting word vector, add this text to the p1 document matrix
            p1Denom += np.sum(trainMatrix[i])
            # Add the number of all insulting words in the sentence
        else:
            p0Num += trainMatrix[i]
            # If the document label is not an insulting word vector, add this text to the p0 document matrix
            p0Denom += np.sum(trainMatrix[i])
            # Add the number of all non insulting words in the sentence
    p1Vect = np.log(p1Num / p1Denom)
    # When the array is divided by the number, the probability of occurrence of each word: p(1) probability (that is, the probability of occurrence of insulting words)
    p0Vect = np.log(p0Num / p0Denom)
    # Probability of occurrence of each word: p(0) probability (that is, the probability of occurrence of non insulting words)
    return p0Vect, p1Vect, pAbusive
    # p0Vect non insulting word vector probability, p1Vect insulting word vector probability, pAbusive is the probability of p(1)

# Slice text
def textParse(bigString):
    listOfTokens = re.split(r'\W+', bigString)
    # Note the difference between py2 and py3. Py2 uses \ w * to slice words, and py3 uses \ W + to slice words
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]
    # lower() turns all word letters into lowercase, excluding punctuation and spaces

# Intertwined verification function
def spamTest():
    docList = []
    # Document list initialization
    classList = []
    # Initialization of classification list, that is to judge whether it is a list of insulting remarks. Yes is 1 and no is 0
    fullText = []
    # Initializes all words that appear in the message
    for i in range(1, 26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        # Read a message and slice it
        docList.append(wordList)
        # Add the contents of the message to the docList
        fullText.extend(wordList)
        # When appending content, keep the original formatted text of the message
        classList.append(1)
        # Classification labels are insulting words
        # The following four lines are the same as the above four lines, except that the last line has no insulting word label
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)
    # Create word vector (de repeating words)
    trainingSet = range(50)
    # Range returns a range object, not an array object
    testSet = []
    # Initialize dataset list
    for i in range(10):
        randIndex = int(np.random.uniform(0, len(trainingSet)))
        # Obtain a random number of 0~len(trainingSet)(50)
        testSet.append(trainingSet[randIndex])
        # Add to the test set, that is, randomly build the training set
        del(list(trainingSet)[randIndex])
        # Delete the selected data to prevent it from being selected again during random selection
        # Note that py3 is used here, and py2 should be written as del(trainingSet[randIndex])
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:
        # trainingSet is 0-49, and docList contains all mail word text
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
        # setOfWords2Vec returns the document vector and appends all word vector information to trainMat
        trainClasses.append(classList[docIndex])
        # Trainclesses and classList information are the same. They are classification labels, alternating between 1 and 0
    p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
    # trainNB0 naive Bayesian classifier trainer to obtain the probability of P(0) and P(1)
    errorCount = 0
    for docIndex in testSet:
        wordVector = setOfWords2Vec(vocabList, docList[docIndex])
        # Obtain the document vector of a certain time to prepare for subsequent classification
        if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
            # classifyNB indicates the naive Bayesian classification result. If the result of classifyNB is different from the label, the error rate will be increased by one
            errorCount += 1
    print("The error rate is:", float(errorCount)/len(testSet))

spamTest()

Example: using naive Bayesian classifier to obtain regional tendency from personal advertisement

Example: use naive Bayes to find words related to region

Collect data: to collect content from RSS feeds, you need to build an interface for RSS feeds
Prepare data: parse the text file into an entry vector
Analyze data: check entries to ensure the correctness of parsing
Training algorithm: use the trainNB0 function established before
Test algorithm: observe the error rate to ensure that the classifier is available. The segmentation program can be modified to reduce the error rate and improve the classification results
Using algorithm: build a complete program to encapsulate all content. Given two RSS feeds, the program will display the most commonly used public words

1. Collecting data: importing RSS feeds

2. Analyze the data: display the words related to the region

3. Complete sample code display

import feedparser
import operator
import numpy as np
from numpy import *

# Create a word vector, and each element will not be repeated
def createVocabList(dataSet):
    vocabSet = set([])
    # Create an empty collection
    for document in dataSet:
        vocabSet = vocabSet | set(document)
        # Create the union of two sets to prevent repeated words. Each run will have a certain randomness
    return list(vocabSet)

# Naive Bayesian word bag model
def bagOfWords2VecMN(vocabList, inputSet):
    # The parameters are vocabulary list and document
    returnVec = [0] * len(vocabList)
    # Create a vector with 0 elements
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
            # If there are words in the vocabulary, the corresponding values are added by 1
    return returnVec
# It shows that the bagOfWords2VecMN function is the same as the setOfWords2Vec function,
# The difference is that whenever a word is encountered, it increases the corresponding value of the word vector instead of setting it to 1

# Naive Bayesian classifier training function
def trainNB0(trainMatrix, trainCategory):
    # trainMatrix is a document matrix, and trainCategory is a vector composed of document labels
    numTrainDocs = len(trainMatrix)
    # Obtain the length of the document matrix, that is, the number of documents
    numWords = len(trainMatrix[0])
    # Number of words in a document
    pAbusive = np.sum(trainCategory) / float(numTrainDocs)
    # The probability that the document is insulting is calculated. Because the insult is 1, the probability of p(1) is calculated
    p0Num = np.ones(numWords)
    p1Num = np.ones(numWords)
    p0Denom = 2.0
    p1Denom = 2.0
    # The above four lines are initialization probabilities
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            # If the document label is an insulting word vector, add this text to the p1 document matrix
            p1Denom += np.sum(trainMatrix[i])
            # Add the number of all insulting words in the sentence
        else:
            p0Num += trainMatrix[i]
            # If the document label is not an insulting word vector, add this text to the p0 document matrix
            p0Denom += np.sum(trainMatrix[i])
            # Add the number of all non insulting words in the sentence
    p1Vect = np.log(p1Num / p1Denom)
    # When the array is divided by the number, the probability of occurrence of each word: p(1) probability (that is, the probability of occurrence of insulting words)
    p0Vect = np.log(p0Num / p0Denom)
    # Probability of occurrence of each word: p(0) probability (that is, the probability of occurrence of non insulting words)
    return p0Vect, p1Vect, pAbusive
    # p0Vect non insulting word vector probability, p1Vect insulting word vector probability, pAbusive is the probability of p(1)

# Naive Bayesian classification function
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = np.sum(vec2Classify * p1Vec) + np.log(pClass1)
    # vec2Classify is multiplied by the insulting word set (corresponding elements are multiplied), and the sum is calculated to calculate its probability
    p0 = np.sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1)
    # vec2Classify is multiplied by the non insulting word set and summed to calculate its probability
    if p1 > p0:
        return 1
    else:
        return 0
    # If the probability is high, which one is returned

# High frequency word sorting
def calcMostFreq(vocabList,fullText):
    # vocabList all words after de duplication and fullText all words before de duplication
    freqDict={}
    #Create an empty dictionary to store words and their occurrences
    for token in vocabList:
        freqDict[token]=fullText.count(token)
        # Get the dictionary key and the number of occurrences
    sortedFreq=sorted(freqDict.items(),key=operator.itemgetter(1),reverse=True)
    # Sorted (object, sort value, reverse=True/False). The default value of reverse is False, indicating positive order, and reverse=True indicates reverse order.
    # Sort each pair of keyvalue s in descending order according to the value of each key, operator Itemsetter (1): get the value of the second position in the item
    return sortedFreq[:30]
    # Returns the top 30 words with the highest ranking

# RSS feed classifier and high frequency word removal
def localWords(feed1,feed0):
    docList=[]
    # Document list initialization
    classList=[]
    # Initialization of classification list, that is to judge whether it is a list of insulting remarks. Yes is 1 and no is 0
    fullText=[]
    # Initialize all words that appear in RSS web content
    minLen=min(len(feed1['entries']),len(feed0['entries']))
    for i in range(minLen):
        wordList=feedparser.textParser(feed1['entries'][i]['summary'])
        # Read an RSS web page and slice it. feed1['entries'][i]['summary '] is the specific location of the tag text
        docList.append(wordList)
        # Add RSS page content to docList
        fullText.extend(wordList)
        # When adding content, keep the original format text of RSS web page content
        classList.append(1)
        # Classification labels are insulting words
        # The following four lines are the same as the above four lines, except that the last line has no insulting word label
        wordList=feedparser.textParser(feed0['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList=createVocabList(docList)
    # Create word vector (de repeating words)
    top30Words=calcMostFreq(vocabList,fullText)
    # Get the top 30 words with the most frequent occurrences
    for pairW in top30Words:
        if pairW[0] in vocabList:
            vocabList.remove(pairW[0])
            # Remove the top 30 words with the highest number of occurrences (which can be understood as stop Thesaurus)
    trainingSet=range(2*minLen)
    # Range returns a range object instead of an array object, which is twice the size of minLen
    testSet=[]
    # Initialize dataset list
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:
        # trainingSet saves the training sample data set, and docList contains all mail word text
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        # setOfWords2Vec returns the document vector and appends all word vector information to trainMat
        trainClasses.append(classList[docIndex])
        # Trainclesses and classList information are the same. They are classification labels, alternating between 1 and 0
    p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
    # trainNB0 naive Bayesian classifier trainer to obtain the probability of P(0) and P(1)
    errorCount = 0
    for docIndex in testSet:
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        # Obtain the document vector of a certain time to prepare for subsequent classification
        if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
            # classifyNB indicates the naive Bayesian classification result. If the result of classifyNB is different from the label, the error rate will be increased by one
            errorCount += 1
    print("The error rate is:", float(errorCount)/len(testSet))
    return vocabList,p0V,p1V
    # spamTest and localWords are essentially the same,
    # But the difference is that spamTest accesses local files, while localWords uses RSS feeds

# The most representative lexical display function
def getTopWords(ny,sf):
    vocabList,p0V,p1V=localWords(ny,sf)
    # Obtain the size of p1V and p0V
    topNY=[]
    topSF=[]
    for i in range(len(p0V)):
        if p0V > -6.0:
            topSF.append((vocabList[i],p0V[i]))
        # Probability without insulting words (logarithm of) > - 6.0 words
        if p1V > -6.0:
            topSF.append((vocabList[i],p1V[i]))
        # Probability of insulting words (logarithm of) > - 6.0 words
    sortedSF=sorted(topSF,key=lambda pair:pair[1],reverse=True)
    # For sorting, lambda functions are anonymous functions: anonymous functions do not have names, but return function names as function results
    # Sorted (d.items(), key=lambda x: x [1]): d.items() is the object to be sorted, key=lambda variable: variable [dimension]
    # The letters x:x [] can be modified at will. The sorting method is sorted according to the dimension in brackets [], [0] is sorted according to the first dimension,
    # [1] Sort according to the second dimension, [2] sort according to the third dimension, and so on. reverse=True descending sort
    # reference material: https://blog.csdn.net/weixin_52626164/article/details/116676414
    print("*SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF*")
    for item in sortedSF:
        print(item[0])
    # Output word item of item tuple
    # The following four lines are the same as the above four lines
    sortedNY=sorted(topNY,key=lambda pair:pair[1],reverse=True)
    print("*NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY*")
    for item in sortedNY:
        print(item[0])

ny=feedparser.parse('https://newyork.craigslist.org/d/rooms-shares/search/roo?format=rss')
sf=feedparser.parse('https://newyork.craigslist.org/d/rooms-shares/search/roo?format=rss')
# Note that the above two lines of code did not run through, because they have failed. The author also found relevant alternatives on the Internet, but did not find them
# The important thing is to learn the idea of the algorithm and apply it. If you understand the idea, you can use other data for verification
print(ny,sf)
vocabList,pSF,pNY=localWords(ny,sf)
getTopWords(ny,sf)

Summary

for classification, the use of probability is sometimes more effective than the use of hard rules. Bayesian probability and Bayesian criterion provide an effective method to estimate unknown probability by using known values.
the demand for data can be reduced through the assumption of conditional independence between features. The independence hypothesis means that the occurrence probability of a word does not depend on other words in the document. Although the independence hypothesis is not correct, park tree Bayes is still an effective classifier. Considerations: the lower overflow can be solved by pairing the probability; Word bag model is better than word set model in solving the problem of document classification.

Topics: Python

Programmer Think