NLP text classification practical introduction super detailed tutorial

Posted by adamwhiles on Fri, 28 Jan 2022 02:58:21 +0100

catalogue

preface

1, Data loading

1. Load package

2. Read data

II. Text processing

1. Remove useless characters

2. Text segmentation

3. Remove stop words

4. Remove low-frequency words

5. Divide training set and test set

3, Convert text into vector form

1. Convert text into TF IDF vector

2. Convert text into word2vec vector

3. Convert text into bert vector

4, Training model and evaluation

1. Use TF IDF vector training

2. Use word2vec vector training

3. Use Bert vector training

summary

preface

The practical task is to predict the score of Douban. In this project, we predict the score of the film through Douban comments. The given input is a piece of text and the output is a specific score. In fact, it is a text classification task. In this project, we need to:

  • Text preprocessing, such as stop word filtering, low-frequency word filtering, special symbol filtering, etc
  • Three methods will be used to convert text into vectors: TF IDF, word2vec and BERT vector.
  • Logistic regression model training and cross validation
  • Evaluate the accuracy of the model

1, Data loading

1. Load package

The first is to load the library. The specific functions of these library functions will be explained when they are used below.

#Import the basic package of data processing
import numpy as np
import pandas as pd

#Import package for counting
from collections import Counter

#Import TF IDF related packages
from sklearn.feature_extraction.text import TfidfTransformer    
from sklearn.feature_extraction.text import CountVectorizer

#Import package for model evaluation
from sklearn import metrics

#Import word2vec related packages
from gensim.models import KeyedVectors

#Import the package related to bert embedding. Refer to the Experimental Manual for notes on downloading mxnet package
from bert_embedding import BertEmbedding
import mxnet

#Package tqdm is used to generate a progress bar when executing an iterative object to monitor the running process of the program
from tqdm import tqdm

#Import some other feature packs
import requests
import os

2. Read data

The next step is through PD read_ csv function reads data. This function is used to read files in csv format and convert table data into dataframe format. Since we only need the two columns of comments and scores, we can get the corresponding data through the index.

#Read data
data = pd.read_csv('data/DMSC.csv')
#Observation data format
data.head()
#Some relevant information of output data
data.info()
#Only the two columns we need in the data are retained: the Comment column and the Star column
data = data[['Comment','Star']]
#Observe the format of the new data
data.head()

Output result:

 IDMovie_Name_ENMovie_Name_CNCrawl_DateNumberUsernameDateStarCommentLike
00Avengers Age of UltronAvengers 22017-01-221Ran pan2015-05-133Even aochuang knows that cosmetic surgery is going to South Korea.2404
110Avengers Age of UltronAvengers 22017-01-2211Shadow Chronicle2015-04-304"A man without a dark side is not trustworthy." The second part peels off the lengthy bedding. The beginning is the climax, and until the end, someone will feel381
220Avengers Age of UltronAvengers 22017-01-2221Flu at any time2015-04-282Aochuang weak explosion, weak explosion, weak explosion!!!!!!120
330Avengers Age of UltronAvengers 22017-01-2231Crow fire hall2015-05-084Different from the first episode, it is a connecting link between the preceding and the following, gloomy and serious, but it won't be bad, unless you don't like Marvel movies. The scene is more grand30
440Avengers Age of UltronAvengers 22017-01-2241Office sweetheart2015-05-105After watching, I said excitedly to my friend, wait, what if aochuang is going to destroy Taipei? She patted me on the shoulder. It's okay. Anyway, you bought two16
 CommentStar
0Even aochuang knows that cosmetic surgery is going to South Korea.3
1"A man without a dark side is not trustworthy." The second part peels off the lengthy bedding. The beginning is the climax, and until the end, someone will feel4
2Aochuang weak explosion, weak explosion, weak explosion!!!!!!2
3Different from the first episode, it is a connecting link between the preceding and the following, gloomy and serious, but it won't be bad, unless you don't like Marvel movies. The scene is more grand4
4After watching, I said excitedly to my friend, wait, what if aochuang is going to destroy Taipei? She patted me on the shoulder. It's okay. Anyway, you bought two5

II. Text processing

1. Remove useless characters

Through regular matching, emoticons and other characters, residual colons and symbols and spaces are removed.

# TODO1: remove some useless characters, determine a character geometry and remove it from the text
#    your to do 
#Remove alphanumeric expressions and other characters
import re

def clear_character(sentence):
    pattern1='[a-zA-Z0-9]'
    pattern2 = re.compile(u'[^\s1234567890:: ' + '\u4e00-\u9fa5]+')
    pattern3='['!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+'
    line1=re.sub(pattern1,'',sentence)   #Remove English letters and numbers
    line2=re.sub(pattern2,'',line1)   #Remove emoticons and other characters
    line3=re.sub(pattern3,'',line2)   #Remove residual colons and other symbols
    new_sentence=''.join(line3.split()) #Remove blank
    return new_sentence
data["comment_processed"]=data['Comment'].apply(clear_character)

data.head()

Output result:

 CommentStarcomment_processed
0Even aochuang knows that cosmetic surgery is going to South Korea.3Even aochuang knows that cosmetic surgery is going to South Korea
1"A man without a dark side is not trustworthy." The second part peels off the lengthy bedding. The beginning is the climax, and until the end, someone will feel4A person without a dark side is not trustworthy. The second film strips off the lengthy bedding. From the beginning, that is, the climax to the end, some people will feel that there are only action stunts left
2Aochuang weak explosion, weak explosion, weak explosion!!!!!!2Aochuang weak explosion, weak explosion, weak explosion
3Different from the first episode, it is a connecting link between the preceding and the following, gloomy and serious, but it won't be bad, unless you don't like Marvel movies. The scene is more grand4Different from the first episode, the connecting link is gloomy and serious, but it won't be bad, unless you don't like Marvel movies, the scenes are more grand, singles and group war
4After watching, I said excitedly to my friend, wait, what if aochuang is going to destroy Taipei? She patted me on the shoulder. It's okay. Anyway, you bought two5After watching, I said excitedly to my friend, wait, what if aochuang is going to destroy Taipei? She patted me on the shoulder. It's okay. Anyway, you bought two travel insurance


2. Text segmentation

Use jieba word segmentation to segment the input text.

# TODO2: import the Chinese word segmentation package jieba and use jieba to segment the original text
import jieba
def comment_cut(content):
    # TODO: use stuttering to complete the word segmentation of each comment
#     seg = jieba.lcut(content)
    seg = list(jieba.cut(content.strip()))
    return seg

# Output progress bar
tqdm.pandas(desc='apply')
data['comment_processed'] = data['comment_processed'].progress_apply(comment_cut)
# Observe the format of the new data
data.head()

Output result:

 CommentStarcomment_processed
0Even aochuang knows that cosmetic surgery is going to South Korea.3[Lian, aochuang, all know, cosmetic surgery, going to, Korea]
1"A man without a dark side is not trustworthy." The second part peels off the lengthy bedding. The beginning is the climax, and until the end, someone will feel4[one, no, dark side, people, no, worthy, trusted, Part II, stripped, lengthy, and
2Aochuang weak explosion, weak explosion, weak explosion!!!!!!2[altron, weak, explosive, weak, explosive, weak, explosive, weak, explosive, explosive, ah]
3Different from the first episode, it is a connecting link between the preceding and the following, gloomy and serious, but it won't be bad, unless you don't like Marvel movies. The scene is more grand4[different from the first episode, connecting the preceding and the following, gloomy and serious, but also, not, no, good-looking, ah
4After watching, I said excitedly to my friend, wait, what if aochuang is going to destroy Taipei? She patted me on the shoulder. It's okay. Anyway, you bought two5[after watching, I, excitedly, yes, friends, say, wait, aochuang, want, come, destroy, Taipei

3. Remove stop words

Load the stop words list and remove the stop words.

# TODO3: set stop words and remove them from the text

# Download the Chinese stop list to data / stopword JSON, download address: https://github.com/goto456/stopwords/
if not os.path.exists('data/stopWord.json'):
    stopWord = requests.get("https://raw.githubusercontent.com/goto456/stopwords/master/cn_stopwords.txt")
    with open("data/stopWord.json", "wb") as f:
         f.write(stopWord.content)

# Read the downloaded stop phrase list and save it in the list
with open("data/stopWord.json","r",encoding='utf-8') as f:
    stopWords = f.read().split("\n")  
    
    
# Remove stop words
def rm_stop_word(wordList):
    # your code, remove stop words
    # TODO
    filtered_words = [word for word in wordList if word not in stopWords]
    return filtered_words
#In this line of code progress_ The apply() function is equivalent to The function of apply () is written as progress_ The apply() function can be monitored by the tqdm package to output the progress bar.
data['comment_processed'] = data['comment_processed'].progress_apply(rm_stop_word)
# Observe the format of the new data
data.head()

Output result:

 CommentStarcomment_processed
0Even aochuang knows that cosmetic surgery is going to South Korea.3[aochuang, know, cosmetic surgery, Korea]
1"A man without a dark side is not trustworthy." The second part peels off the lengthy bedding. The beginning is the climax, and until the end, someone will feel4[one, no, dark side, worthy of trust, Part II, stripping, lengthy, bedding, opening, climax
2Aochuang weak explosion, weak explosion, weak explosion!!!!!!2[aochuang, weak, explosive, weak, explosive, weak, explosive]
3Different from the first episode, it is a connecting link between the preceding and the following, gloomy and serious, but it won't be bad, unless you don't like Marvel movies. The scene is more grand4[Episode 1, different, connecting the preceding and the following, gloomy, serious, not good-looking, originally, like, marvel, film
4After watching, I said excitedly to my friend, wait, what if aochuang is going to destroy Taipei? She patted me on the shoulder. It's okay. Anyway, you bought two5[after watching, excited, friend, say, aochuang, destruction, Taipei, thick, pat, shoulder, nothing, anyway

4. Remove low-frequency words

Through the pandas index loop comment column, all words are merged into one list. Then count the word frequency through Counter, and remove the words with word frequency less than 10.

# TODO4: remove low-frequency words, remove words with word frequency less than 10, and store the results in data['comment_processed ']

import jieba
import re
import pandas as pd
from collections import Counter

data.head()

list_set = []

for i in range(len(data)):
    for j in data.iloc[i]['comment_processed']: 
        list_set.extend(j)
        
words_count = Counter(list_set)

min_threshold=10
my_dict = {k: v for k, v in words_count.items() if v < min_threshold}
filteredA = Counter(my_dict)

# Remove low frequency words
def rm_low_frequence_word(wordList):
    # your code, remove stop words
    # TODO
    filtered_words = [word for word in wordList if word not in filteredA]
    return filtered_words
                      
#In this line of code progress_ The apply() function is equivalent to The function of apply () is written as progress_ The apply() function can be monitored by the tqdm package to output the progress bar.
data['comment_processed'] = data['comment_processed'].progress_apply(rm_low_frequence_word)
data.head()

5. Divide training set and test set

Select 20% of the corpus as the test data and the rest as the training data. The data is divided into training set and test set.  comments_train (list) saves the text for training, comments_test(list) saves the text used for the test. y_train, y_test is the corresponding label (1,2,3,4,5)

from sklearn.model_selection import train_test_split
X = data['comment_processed']
y = data['Star']
test_ratio = 0.2
comments_train, comments_test, y_train, y_test = train_test_split(X, y, test_size=test_ratio, random_state=0)
print(comments_train.head(),y_train.head)

Output result:

104861 [no, read, novel, really, film, impressed, two major, female owner, beauty, acting, great, maybe
 190626 [Jigong, Lao, later, one, expectation, director, plot, picture, needless to say, finally, Hua Hua, Liu
 198677 [reputation, picture, can't find, praise, 43 years old, Xinhai, sincerity, should, make, change, inside, things]
207320 [rabbit, cute]
106219 [hope, in the end, I can live like...]
Name: comment_processed, dtype: object <bound method NDFrame.head of 
104861    5
190626    4
198677    3
207320    5
106219    4
         ..
176963    4
117952    2
173685    3
43567     5
199340    4
Name: Star, Length: 170004, dtype: int64>

3, Convert text into vector form

In this section, we will convert text into vectors in three different ways:

  • Using TF IDF vectors
  • Using word2vec
  • Using bert vectors

After being converted into vectors, the model is trained.

1. Convert text into TF IDF vector

Through the feature of sklearn_ extraction. text. TfidfTransformer module converts training text and test text into TF IDF vector. Since the word list output by the TfidfTransformer module is connected with spaces instead of commas, you need to convert the following list form first.

from sklearn.feature_extraction.text import TfidfTransformer

comments_train1 = [' '.join(i) for i in comments_train]
comments_test1 = [' '.join(i) for i in comments_test]

print(comments_train[0:5])
print(comments_train1[0:5])

tfidf2 = TfidfTransformer()
counter = CountVectorizer(analyzer='word')
# counts = counter.fit_transform(comments_train1)
tfidf_train = tfidf2.fit_transform(counter.fit_transform(comments_train1))
tfidf_test=tfidf2.transform(counter.transform(comments_test1))
print(tfidf_train.shape,tfidf_test.shape)

Output result:

104861 [no, read, novel, really, film, impressed, two major, female owner, beauty, acting, great, maybe
 190626 [Jigong, Lao, later, one, expectation, director, plot, picture, needless to say, finally, Hua Hua, Liu
 198677 [reputation, picture, can't find, praise, 43 years old, Xinhai, sincerity, should, make, change, inside, things]
207320 [rabbit, cute]
106219 [hope, in the end, I can live like...]
Name: comment_processed, dtype: object
 ['I haven't read a novel, but I really admire the two women's great acting skills. Maybe a girl will be jealous and love friendship for a while', 'after Mr. Gong, I expected the director's plot picture. I don't have to say. Finally, I burst into tears and was a little embarrassed. After that, I finally realized that Japan's delicate style film industry occupies an important place. I knew someone respected Japan before The film now understands that no one can be better than the cure department at present ',' no praise can be found in the picture ',' 43 year old xinhaicheng should make something more inside ',' the rabbit is so cute ',' I hope I can live like that in the end ']
(170004, 87026) (42502, 87026)

2. Convert text into word2vec vector

Because training an efficient word2vec word vector often requires a very large corpus and computing resources, we usually don't train Wordvec word vector ourselves, but directly use the trained word vector that is open-source on the Internet. sgns.zhihu.word is from Chinese-Word-Vectors Download the pre trained Chinese word vector file.
Via keyedvectors load_ word2vec_ The format() function loads the pre trained word vector file.

For each sentence, the vector of the sentence is generated. The specific method is to average the vectors of all words contained in the sentence.

model = KeyedVectors.load_word2vec_format('data/sgns.zhihu.word')
model['today']
vocabulary = model.vocab

vec_lem=model['Lu Xun'].shape[0]
def comm_vec(c):
    vec_com=np.zeros(vec_lem)
    coun=0
    for w in c:
        if w in model:
            vec_com+=model[w]
            coun+=1
    return vec_com/coun

word2vec_train=np.vstack(comments_train.progress_apply(comm_vec))
word2vec_test=np.vstack(comments_test.progress_apply(comm_vec))

3. Convert text into bert vector

transformers is a pre training model library provided by huggingface. You can easily call the API to get the word vector.

Next, we mainly introduce how to call the transformers library to generate Chinese word vectors. The pre training model used is Bert_mini, a Chinese text pre training model. Through the berttokenizer, the bertmodel function converts words into vectors.

The next step is through process_ The word() function spell the obtained word vector into a word vector, and then use comm_ The vec() function spell word vectors into sentence vectors.

from transformers import BertTokenizer,BertModel
import torch
import logging

# set cuda
gpu = 0
use_cuda = gpu >= 0 and torch.cuda.is_available()
print(use_cuda)
if use_cuda:
    torch.cuda.set_device(gpu)
    device = torch.device("cuda", gpu)
else:
    device = torch.device("cpu")
logging.info("Use cuda: %s, gpu id: %d.", use_cuda, gpu)


bert_model_dir='bert-mini'
tokenizer=BertTokenizer.from_pretrained(bert_model_dir)
Bertmodel=BertModel.from_pretrained(bert_model_dir)

word=['Today I']
input_id=tokenizer(word,padding=True,truncation=True,max_length=10,return_tensors='pt')
result=Bertmodel(input_id['input_ids'])
# print(result)
vec_len=len(result[0][0][1])
# vec_len=result[0,1,0].shape[0]
print(vec_len)

def process_word(w):
    vec_com=np.zeros(vec_len)
    num=len(w)
    input_id=tokenizer(w,padding=True,truncation=True,max_length=10,return_tensors='pt')
    res=Bertmodel(input_id['input_ids'])
    k=len(res[0][0])
    for i in range(k):
#         print(res[0][0][i].detach().numpy())
        vec_com+=res[0][0][i].detach().numpy()
    return vec_com/k

def comm_vec(c):
    vec_com=np.zeros(vec_len)
    coun=0
    for w in c:
        if w in model:
            vec_com+=process_word(w)
            coun+=1
        break    
    return vec_com/coun

bert_train=np.vstack(comments_train.progress_apply(comm_vec))
bert_test=np.vstack(comments_test.progress_apply(comm_vec))

print (tfidf_train.shape, tfidf_test.shape)
print (word2vec_train.shape, word2vec_test.shape)
print (bert_train.shape, bert_test.shape)

Output result:

(170004, 87026) (42502, 87026)
(170004, 300) (42502, 300)
(170004, 256) (42502, 256)

4, Training model and evaluation

To train the logistic regression model for the above three different vector representations, we need to do the following:

  • Build model
  • Training model (and cross validation)
  • Output the best results

Import packages for logistic regression and cross validation.

# Import package for logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

1. Use TF IDF vector training

clf=LogisticRegression()
param_grid = {
    'C': [0.01,0.1, 1.0, 2.0,10,100], 
    'penalty' : ['l1', 'l2']
}
grid_search = GridSearchCV(estimator=clf,
                           param_grid=param_grid, 
                           scoring='accuracy',
                           cv=5,
                           n_jobs=-1)
grid_search.fit(tfidf_train, y_train)

print(grid_search.best_params_)
print(grid_search.best_score_)

lr_best=LogisticRegression(penalty='l2',C=2)
lr_best.fit(tfidf_train, y_train)
tf_idf_y_pred=lr_best.predict(tfidf_test)

print('TF-IDF LR test accuracy %s' % metrics.accuracy_score(y_test, tf_idf_y_pred))
#Application of logistic regression model on test set_ Score
print('TF-IDF LR test F1_score %s' % metrics.f1_score(y_test, tf_idf_y_pred,average="macro"))

Output result:

{'C': 1.0, 'penalty': 'l2'}
0.4747594131314477
TF-IDF LR test accuracy 0.47851865794550846
TF-IDF LR test F1_score 0.43133900686271376

2. Use word2vec vector training

clf=LogisticRegression()
param_grid = {
    'C': [0.01,0.1, 1.0, 2.0,10,100], 
    'penalty' : ['l1', 'l2'],
    'solver':['liblinear','lbfgs','sag','saga']
}
grid_search = GridSearchCV(estimator=clf,
                           param_grid=param_grid, 
                           scoring='accuracy',
                           cv=5,
                           n_jobs=-1)
grid_search.fit(word2vec_train, y_train)

print(grid_search.best_params_)
print(grid_search.best_score_)

lr_best=LogisticRegression(penalty='l1',C=100,solver='saga')
lr_best.fit(word2vec_train, y_train)
word2vec_y_pred=lr_best.predict(word2vec_test)


print('Word2vec LR test accuracy %s' % metrics.accuracy_score(y_test, word2vec_y_pred))
#Application of logistic regression model on test set_ Score
print('Word2vec LR test F1_score %s' % metrics.f1_score(y_test, word2vec_y_pred,average="macro"))

Output result:

{'C': 1.0, 'penalty': 'l2', 'solver': 'lbfgs'}
0.4425013587835652
Word2vec LR test accuracy 0.4447555409157216
Word2vec LR test F1_score 0.37275840165350765

3. Use Bert vector training

clf=LogisticRegression()
param_grid = {
    'C': [0.01,0.1, 1.0, 2.0,10,100], 
    'penalty' : ['l1', 'l2'],
    'solver':['liblinear','lbfgs','sag','saga']
}
grid_search = GridSearchCV(estimator=clf,
                           param_grid=param_grid, 
                           scoring='accuracy',
                           cv=5,
                           n_jobs=-1)
grid_search.fit(bert_train, y_train)

print(grid_search.best_params_)
print(grid_search.best_score_)

Output result:

{'C': 2.0, 'penalty': 'l2', 'solver': 'sag'}
0.4072104199357458
Bert LR test accuracy 0.4072090725142346
Bert LR test F1_score 0.3471216422860073

summary

Through the training of word vectors obtained in different ways, the accuracy of five classification is more than 40%, and the gap is small. It may be that only the logistic regression model is used, and the effect is not much improved. Therefore, the effect can be improved from the following aspects:

  • Improvement of sentence vector fusion method
  • Solve the problem of category imbalance
  • Word vector model retraining
  • Improvement of jieba word segmentation effect
  • Classification with deep neural network

The above is the most detailed introductory tutorial in the history of text classification. The codeword is not easy. I hope you can like the collection more. If you need code or model, you can leave a message or private letter in the comment area.

reference resources:

Greedy college natural language processing

Topics: Machine Learning Pytorch NLP logistic regressive