Based on machine learning and TFIDF emotion classification algorithm, explain natural language processing in detail

Posted by Bah! Name on Wed, 05 Jan 2022 22:28:46 +0100

Abstract: This article will explain the process of natural language processing in detail, the emotion classification algorithm based on machine learning and TFIDF, and compare various classification algorithms (SVM, RF, LR, Boosting)

This article is shared from Huawei cloud community< [Python artificial intelligence] 23 Emotion classification based on machine learning and TFIDF (including detailed NLP data cleaning) >, by eastmount.

In data analysis and data mining, it usually needs to go through the steps of preliminary preparation, data crawling, data preprocessing, data analysis, data visualization and evaluation analysis. The work before data analysis takes nearly half the working time of data engineers, and the data preprocessing will also directly affect the quality of subsequent model analysis. Figure is the basic steps of data preprocessing, including Chinese word segmentation, part of speech tagging, data cleaning, feature extraction (vector space model storage), weight calculation (TF-IDF), etc.

I Chinese word segmentation

After the reader crawls the Chinese data set using Python, he first needs to perform Chinese word segmentation on the data set. Because words in English are related by spaces, and phrases can be divided directly according to the spaces, word segmentation is not required. Chinese characters are closely connected and have semantics, and there is no obvious separation point between words. Therefore, it is necessary to use Chinese word segmentation technology to segment sentences in the corpus according to spaces into a word sequence. Now let's introduce Chinese word segmentation technology and Jiaba Chinese word segmentation tool in detail.

Chinese Word Segmentation refers to the segmentation of Chinese character sequences into individual word or word string sequences. It can establish separation marks in Chinese strings without word boundaries, usually separated by spaces. Let's take a simple example to segment the sentence "I'm a programmer".

Input: I am a programmer
 Output 1: I\yes\Course\order\member
 Output 2: I am\It's Cheng\program\Sequencer
 Output 3: I\yes\programmer

For example, the code is mainly imported Jieba extension package, and then call its function for Chinese word segmentation.

#encoding=utf-8  
import jieba  
 
text = "Beijing Polytechnic University students come to apply"  

data = jieba.cut(text,cut_all=True)   #Full mode
print("[Full mode]: ", " ".join(data))
 
data = jieba.cut(text,cut_all=False)  #Precise mode  
print("[Precise mode]: ", " ".join(data))
 
data = jieba.cut(text)  #The default is precise mode 
print("[Default mode]: ", " ".join(data))

data = jieba.cut_for_search(text)  #Search engine mode   
print("[Search engine mode]: ", " ".join(data))

The above code output is as follows, including the results of full mode, accurate mode and search engine mode output.

II Data cleaning

In the process of analyzing the corpus, there are usually some dirty data or noisy phrases that interfere with our experimental results, which requires Data Cleaning of the corpus after word segmentation. For example, Jieba tool is used for Chinese word segmentation. It may have some dirty data or stop words, such as "we", "de", "do" and so on. These words reduce the data quality. In order to get better analysis results, it is necessary to clean the data set or stop word filtering.

incomplete data
Duplicate data
Error data
Stop words

Here we mainly explain the stop word filtering to delete the stop words that occur frequently but do not affect the text topic. Introducing stop in Jieb word segmentation_ words. Txt stop word dictionary. If it exists, filter it.

The following is the comment information of "Huangguoshu waterfall" from websites such as public comments and meituan. We use Jieba tool to segment Chinese words.

Praise: 5000
Bad comments: 1000

Full code:

# -*- coding:utf-8 -*-
import csv
import pandas as pd
import numpy as np
import jieba
import jieba.analyse

#Add custom dictionaries and deactivate Dictionaries
jieba.load_userdict("user_dict.txt")
stop_list = pd.read_csv('stop_words.txt',
                        engine='python',
                        encoding='utf-8',
                        delimiter="\n",
                        names=['t'])['t'].tolist()

#Chinese word segmentation function
def txt_cut(juzi):
    return [w for w in jieba.lcut(juzi) if w not in stop_list]

#Write word segmentation results
fw = open('fenci_data.csv', "a+", newline = '',encoding = 'gb18030')
writer = csv.writer(fw)  
writer.writerow(['content','label'])

# Using CSV Dictreader reads the information in the file
labels = []
contents = []
file = "data.csv"
with open(file, "r", encoding="UTF-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        # Data element acquisition
        if row['label'] == 'Praise':
            res = 0
        else:
            res = 1
        labels.append(res)
        content = row['content']
        seglist = txt_cut(content)
        output = ' '.join(list(seglist))            #Space splicing
        contents.append(output)
 
        #File write
        tlist = []
        tlist.append(output)
        tlist.append(res)
        writer.writerow(tlist)

print(labels[:5])
print(contents[:5])
fw.close()

The running results are shown in the figure below. On the one hand, it filters special punctuation marks and stop words, and on the other hand, it imports user_dict.txt dictionary divides proper nouns such as "Huangguoshu waterfall" and "scenic spot", otherwise it may be divided into "Huangguoshu" and "waterfall", "scenery" and "scenic spot".

Before data cleaning

I still remember when I was a child, I often stood in front of the TV, waiting for the broadcast of journey to the West. "You carry the load, I lead the horse. Climb the mountain, wade in the water and slide on both shoulders "The familiar song sounded in our ears again. The water in the lyrics is the water in Guizhou. To be exact, it is the Huangguoshu waterfall in Guizhou. That waterfall flows into our childhood and makes us linger. Huangguoshu waterfall is not only a waterfall, but a large scenic spot, including steep slope pond waterfall, Tianxing bridge scenic spot and Huangguoshu waterfall, among which Huangguoshu Falls The great waterfall is the most famous.

After data cleaning

I remember when I was a child, I waited in front of the TV for the broadcast of journey to the West. I carried my horse across the mountain and waded across the water. I was familiar with the song. When the song sounded, the water level in the lyrics did say that the curtain of Huangguoshu waterfall in Guizhou flowed into my childhood. Huangguoshu Waterfall Scenic Spot includes steep slope pond waterfall Tianxing bridge scenic spot Huangguoshu waterfall Huangguoshu waterfall name

III Feature extraction and TF-IDF calculation

1. Basic concepts

Weight calculation is to measure the importance of feature items in document representation through feature weight, and give feature words a certain weight to measure and count text feature words. TF-IDF (term frequency inverters document frequency) is a classical weight calculation technology used in data analysis and information processing in recent years. This technology calculates the importance of the feature word in the whole corpus according to the number of times the feature word appears in the text and the document frequency in the whole corpus. Its advantage is that it can filter out some common but insignificant words and retain as many feature words with high influence as possible.

The calculation formula of TF-IDF is as follows, where TF-IDF represents the product of word frequency TF and inverted text word frequency IDF. The weight in TF-IDF is directly proportional to the frequency of feature items in documents and inversely proportional to the number of documents in the whole corpus. The larger the TF-IDF value, the higher the importance of the feature word to the text.

Among them, the calculation formula of TF word frequency is as follows. Ni and j are the times that the feature word ti appears in the training text Dj, and the denominator is the number of all feature words in the text Dj. The calculation result is the word frequency of a feature word.

Inverse Document Frequency (IDF) is a classical method proposed by Spark Jones in 1972 to calculate the weight of words and documents. The calculation formula is as follows. The parameter | D | represents the total number of texts in the corpus, and | Dt | represents the number of feature words tj contained in the text.

In the inverse document frequency method, the weight changes inversely with the number of documents of feature words. For example, some common words such as "we", "but" and "de" appear frequently in all documents, but their IDF value is very low. Even if it appears in every document, the calculation result of log1 is 0, which reduces the role of these common words; On the contrary, if a word introducing "artificial intelligence" appears only many times in the document, its function is very high.

The core idea of TF-IDF technology is that if a feature word appears frequently in one article and rarely in other articles, it is considered that this word or phrase has good classification ability and is suitable for weight calculation. TF-IDF algorithm is simple and fast, and the results are also in line with the actual situation. It is a common means in the fields of text mining, emotion analysis, topic distribution and so on.

2. Code implementation

Scikit learn mainly uses two classes in scikit learn, CountVectorizer and TfidfTransformer, to calculate word frequency and TF-IDF value.

CountVectorizer

This class is to convert text words into the form of word frequency matrix. For example, the text "I am a teacher" contains four words. The word frequency of their corresponding words is 1, and "I", "am", "a" and "teacher" appear once respectively. CountVectorizer will generate a matrix a[M][N], a total of M text corpora and N words. For example, a[i][j] represents the word frequency of word j under class I text. Then call fit_ The transform() function calculates the number of occurrences of each word, get_ feature_ The names () function gets all text keywords in the thesaurus.

TfidTransformer

After the word frequency matrix is calculated using the CountVectorizer class, the TF-IDF value of each word in the vectorizer variable is counted through the TfidfTransformer class. TF-IDF values are stored in the form of matrix array. Each row of data represents a text corpus, and each column of each row represents the weight corresponding to one of the features. After TF-IDF is obtained, various data analysis algorithms can be used for analysis, such as cluster analysis, LDA topic distribution, public opinion analysis, etc.

Full code:

# -*- coding:utf-8 -*-
import csv
import pandas as pd
import numpy as np
import jieba
import jieba.analyse
from scipy.sparse import coo_matrix
from sklearn import feature_extraction  
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

#----------------------------------Step 1: read the file--------------------------------
with open('fenci_data.csv', 'r', encoding='UTF-8') as f:
    reader = csv.DictReader(f)
    labels = []
    contents = []
    for row in reader:
        labels.append(row['label']) #0-Praise 1-negative comment
        contents.append(row['content'])

print(labels[:5])
print(contents[:5])

#----------------------------------Step 2 data preprocessing--------------------------------
#Convert the words in the text into word frequency matrix. The matrix element a[i][j] represents the word frequency of j words under class I text
vectorizer = CountVectorizer()

#This class will count the tf of each word-idf Weight 
transformer = TfidfTransformer()

#First fit_transform is the calculation of tf-idf the second fit_transform Is to convert text into word frequency matrix
tfidf = transformer.fit_transform(vectorizer.fit_transform(contents))
for n in tfidf[:5]:
    print(n)
print(type(tfidf))

# Get all words in the word bag model  
word = vectorizer.get_feature_names()
for n in word[:10]:
    print(n)
print("Number of words:", len(word))

#Will tf-idf Matrix extraction, elements w[i][j]express j Word in i In class text tf-idf weight
#X = tfidf.toarray()
X = coo_matrix(tfidf, dtype=np.float32).toarray() #Sparse matrix note float
print(X.shape)
print(X[:10])

The output results are as follows:

<class 'scipy.sparse.csr.csr_matrix'>
aaaaa
achievements
amazing
ananananan
ancient
anshun
aperture
app

Number of words: 20254
(6074, 20254)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

3.MemoryError memory overflow error

When we have a large amount of data, the matrix can not store such a large amount of data, and the following errors will occur:

ValueError: array is too big; arr.size * arr.dtype.itemsize is larger than the maximum possible size.
MemoryError: Unable to allocate array with shape (26771, 69602) and data type float64

My solutions are as follows:

Stop word filtering reduces unwanted feature words
The scipy package provides the creation of sparse matrix, using coo_matrix(tfidf, dtype=np.float32) converts tfidf
CountVectorizer(min_df=5) increases min_ DF parameter, which filters out feature words with less frequency. This parameter can be debugged continuously
max_df is used to delete terms that appear too frequently. It is called corpus specific stop words. The default is max_df is 1.0, which ignores terms that appear in 100% of documents; min_df is used to delete the infrequent term min_df=5 means ignoring terms that appear in less than 5 documents.
Use GPU or expand memory to solve

IV Emotion classification based on logistic regression

After obtaining the text TF-IDF value, this section briefly explains the process of emotion classification using TF-IDF value, mainly including the following steps:

Generate word frequency matrix for Chinese word segmentation and data cleaned corpus. It mainly calls the CountVectorizer class to calculate the word frequency matrix, and the generated matrix is X.
Call the TfidfTransformer class to calculate the TF-IDF value of word frequency matrix X to obtain the Weight matrix.
Call Sklearn machine learning package to perform classification operation, call fit() function for training, and assign the predicted class mark to the pre array.
Call the PCA() function of Sklearn library to reduce the dimension, reduce these features to two dimensions, corresponding to the X and Y axes, and then visually present them.
Algorithm optimization and algorithm evaluation.

Complete code of logistic regression:

# -*- coding:utf-8 -*-
import csv
import pandas as pd
import numpy as np
import jieba
import jieba.analyse
from scipy.sparse import coo_matrix
from sklearn import feature_extraction  
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn import neighbors
from sklearn.naive_bayes import MultinomialNB

#----------------------------------Step 1: read the file--------------------------------
with open('fenci_data.csv', 'r', encoding='UTF-8') as f:
    reader = csv.DictReader(f)
    labels = []
    contents = []
    for row in reader:
        labels.append(row['label']) #0-Praise 1-negative comment
        contents.append(row['content'])

print(labels[:5])
print(contents[:5])

#----------------------------------Step 2 data preprocessing--------------------------------
#Convert the words in the text into word frequency matrix. The matrix element a[i][j] represents the word frequency of j words under class I text
vectorizer = CountVectorizer(min_df=5)

#This class will count the tf of each word-idf Weight 
transformer = TfidfTransformer()

#First fit_transform is the calculation of tf-idf the second fit_transform Is to convert text into word frequency matrix
tfidf = transformer.fit_transform(vectorizer.fit_transform(contents))
for n in tfidf[:5]:
    print(n)
print(type(tfidf))

# Get all words in the word bag model  
word = vectorizer.get_feature_names()
for n in word[:10]:
    print(n)
print("Number of words:", len(word))

#Will tf-idf Matrix extraction, elements w[i][j]express j Word in i In class text tf-idf weight
#X = tfidf.toarray()
X = coo_matrix(tfidf, dtype=np.float32).toarray() #Sparse matrix note float
print(X.shape)
print(X[:10])

#----------------------------------Step 3 data division--------------------------------
#Using train_test_split split X y list
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    labels, 
                                                    test_size=0.3, 
                                                    random_state=1)

#--------------------------------Step 4 machine learning classification--------------------------------
# Logistic regression classification method model
LR = LogisticRegression(solver='liblinear')
LR.fit(X_train, y_train)
print('Accuracy of model:{}'.format(LR.score(X_test, y_test)))
pre = LR.predict(X_test)
print("Logistic regression classification")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
print("\n")

The operation results are shown in the figure below:

V Algorithm performance evaluation

We need to write our own programs to realize many real-time algorithm evaluation, such as drawing ROC curve, counting various characteristic information and displaying 4-digit results. Here, the author tries to customize the accuracy, Recall and F-measure. The calculation formula is as follows:

Since this paper mainly focuses on the problem of 2 classification, the experimental evaluation is mainly divided into 0 and 1. The complete code is as follows:

# -*- coding:utf-8 -*-
import csv
import pandas as pd
import numpy as np
import jieba
import jieba.analyse
from scipy.sparse import coo_matrix
from sklearn import feature_extraction  
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn import neighbors
from sklearn.naive_bayes import MultinomialNB

#----------------------------------Step 1: read the file--------------------------------
with open('fenci_data.csv', 'r', encoding='UTF-8') as f:
    reader = csv.DictReader(f)
    labels = []
    contents = []
    for row in reader:
        labels.append(row['label']) #0-Praise 1-negative comment
        contents.append(row['content'])

print(labels[:5])
print(contents[:5])

#----------------------------------Step 2 data preprocessing--------------------------------
#Convert the words in the text into word frequency matrix. The matrix element a[i][j] represents the word frequency of j words under class I text
vectorizer = CountVectorizer(min_df=5)

#This class will count the tf of each word-idf Weight 
transformer = TfidfTransformer()

#First fit_transform is the calculation of tf-idf the second fit_transform Is to convert text into word frequency matrix
tfidf = transformer.fit_transform(vectorizer.fit_transform(contents))
for n in tfidf[:5]:
    print(n)
print(type(tfidf))

# Get all words in the word bag model  
word = vectorizer.get_feature_names()
for n in word[:10]:
    print(n)
print("Number of words:", len(word))

#Will tf-idf Matrix extraction, elements w[i][j]express j Word in i In class text tf-idf weight
#X = tfidf.toarray()
X = coo_matrix(tfidf, dtype=np.float32).toarray() #Sparse matrix note float
print(X.shape)
print(X[:10])

#----------------------------------Step 3 data division--------------------------------
#Using train_test_split split X y list
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    labels, 
                                                    test_size=0.3, 
                                                    random_state=1)

#--------------------------------Step 4 machine learning classification--------------------------------
# Logistic regression classification method model
LR = LogisticRegression(solver='liblinear')
LR.fit(X_train, y_train)
print('Accuracy of model:{}'.format(LR.score(X_test, y_test)))
pre = LR.predict(X_test)
print("Logistic regression classification")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))

#----------------------------------Step 5 evaluation results--------------------------------
def classification_pj(name, y_test, pre):
    print("Algorithm evaluation:", name)
 
    # Accuracy = Total number of correctly identified individuals /  Total number of individuals identified
    # Recall rate = Total number of correctly identified individuals /  Total number of individuals in the test set
    # F value f-measure = Correct rate * recall  * 2 / (Correct rate + recall )

    YC_B, YC_G = 0,0  #Predict bad good
    ZQ_B, ZQ_G = 0,0  #correct
    CZ_B, CZ_G = 0,0  #existence

    #0-good 1-bad At the same time, it is calculated to prevent the change of class standard
    i = 0
    while i<len(pre):
        z = int(y_test[i])   #real 
        y = int(pre[i])      #forecast

        if z==0:
            CZ_G += 1
        else:
            CZ_B += 1
 
        if y==0:
            YC_G += 1
        else:
            YC_B += 1

        if z==y and z==0 and y==0:
            ZQ_G += 1
        elif z==y and z==1 and y==1:
            ZQ_B += 1
        i = i + 1

    print(ZQ_B, ZQ_G, YC_B, YC_G, CZ_B, CZ_G)
    print("")

    # Result output
    P_G = ZQ_G * 1.0 / YC_G
    P_B = ZQ_B * 1.0 / YC_B
    print("Precision Good 0:", P_G)
    print("Precision Bad 1:", P_B)

    R_G = ZQ_G * 1.0 / CZ_G
    R_B = ZQ_B * 1.0 / CZ_B
    print("Recall Good 0:", R_G)
    print("Recall Bad 1:", R_B)

    F_G = 2 * P_G * R_G / (P_G + R_G)
    F_B = 2 * P_B * R_B / (P_B + R_B)
    print("F-measure Good 0:", F_G)
    print("F-measure Bad 1:", F_B)

#function call
classification_pj("LogisticRegression", y_test, pre)

The output results are as follows:

Logistic regression classification
1823 1823
              precision    recall  f1-score   support

           0       0.94      0.99      0.97      1520
           1       0.93      0.70      0.80       303

    accuracy                           0.94      1823
   macro avg       0.94      0.85      0.88      1823
weighted avg       0.94      0.94      0.94      1823

Algorithm evaluation: LogisticRegression
213 1504 229 1594 303 1520

Precision Good 0: 0.9435382685069009
Precision Bad 1: 0.9301310043668122
Recall Good 0: 0.9894736842105263
Recall Bad 1: 0.7029702970297029
F-measure Good 0: 0.9659601798330122
F-measure Bad 1: 0.800751879699248

Vi Algorithm comparison experiment

1.RandomForest

The code is as follows:

# Random forest classification method model n_estimators: number of trees in the forest
clf = RandomForestClassifier(n_estimators=20)
clf.fit(X_train, y_train)
print('Accuracy of model:{}'.format(clf.score(X_test, y_test)))
print("\n")
pre = clf.predict(X_test)
print('Prediction results:', pre[:10])
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("RandomForest", y_test, pre)
print("\n")

Output results:

2.SVM

The code is as follows:

# SVM classification method model
SVM = svm.LinearSVC() #Support vector machine classifier LinearSVC
SVM.fit(X_train, y_train)
print('Accuracy of model:{}'.format(SVM.score(X_test, y_test)))
pre = SVM.predict(X_test)
print("Support vector machine classification")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("LinearSVC", y_test, pre)
print("\n")

Output results:

3. Naive Bayes

The code is as follows:

#naive bayesian model 
nb = MultinomialNB()
nb.fit(X_train, y_train)
print('Accuracy of model:{}'.format(nb.score(X_test, y_test)))
pre = nb.predict(X_test)
print("naive bayesian classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("MultinomialNB", y_test, pre)
print("\n")

Output results:

4.KNN

The accuracy of the algorithm is not high, and the execution time is long. It is not recommended to be used for text analysis. In some cases, the algorithm comparison is OK. The core code is as follows:

#Nearest neighbor algorithm
knn = neighbors.KNeighborsClassifier(n_neighbors=7) 
knn.fit(X_train, y_train)
print('Accuracy of model:{}'.format(knn.score(X_test, y_test)))
pre = knn.predict(X_test)
print("Nearest neighbor classification")
print(classification_report(y_test, pre))
classification_pj("KNeighbors", y_test, pre)
print("\n")

Output results:

5. Decision tree

The code is as follows:

#Decision tree algorithm
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
print('Accuracy of model:{}'.format(dtc.score(X_test, y_test)))
pre = dtc.predict(X_test)
print("Decision tree classification")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("DecisionTreeClassifier", y_test, pre)
print("\n")

Output results:

6.SGD

The code is as follows:

#SGD classification model
from sklearn.linear_model.stochastic_gradient import SGDClassifier
sgd = SGDClassifier()
sgd.fit(X_train, y_train)
print('Accuracy of model:{}'.format(sgd.score(X_test, y_test)))
pre = sgd.predict(X_test)
print("SGD classification")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("SGDClassifier", y_test, pre)
print("\n")

Output results:

7.MLP

The algorithm is slow, and the core code is as follows:

#MLP classification model
from sklearn.neural_network.multilayer_perceptron import MLPClassifier
mlp = MLPClassifier()
mlp.fit(X_train, y_train)
print('Accuracy of model:{}'.format(mlp.score(X_test, y_test)))
pre = mlp.predict(X_test)
print("MLP classification")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("MLPClassifier", y_test, pre)
print("\n")

Output results:

8.GradientBoosting

The algorithm is slow, and the code is as follows:

#GradientBoosting classification model
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
print('Accuracy of model:{}'.format(gb.score(X_test, y_test)))
pre = gb.predict(X_test)
print("GradientBoosting classification")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("GradientBoostingClassifier", y_test, pre)
print("\n")

Output results:

9.AdaBoost

The code is as follows:

#AdaBoost classification model
from sklearn.ensemble import AdaBoostClassifier
AdaBoost = AdaBoostClassifier()
AdaBoost.fit(X_train, y_train)
print('Accuracy of model:{}'.format(AdaBoost.score(X_test, y_test)))
pre = AdaBoost.predict(X_test)
print("AdaBoost classification")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("AdaBoostClassifier", y_test, pre)
print("\n")

Output results:

Click focus to learn about Huawei cloud's new technologies for the first time~

Topics: Algorithm Machine Learning NLP

Programmer Think

Based on machine learning and TFIDF emotion classification algorithm, explain natural language processing in detail

I Chinese word segmentation

II Data cleaning

III Feature extraction and TF-IDF calculation

1. Basic concepts

2. Code implementation

3.MemoryError memory overflow error

IV Emotion classification based on logistic regression

V Algorithm performance evaluation

Vi Algorithm comparison experiment

1.RandomForest

2.SVM

3. Naive Bayes

4.KNN

5. Decision tree

6.SGD

7.MLP

8.GradientBoosting

9.AdaBoost

Hot Topics