catalogue
5. Divide training set and test set
3, Convert text into vector form
1. Convert text into TF IDF vector
2. Convert text into word2vec vector
3. Convert text into bert vector
4, Training model and evaluation
2. Use word2vec vector training
preface
The practical task is to predict the score of Douban. In this project, we predict the score of the film through Douban comments. The given input is a piece of text and the output is a specific score. In fact, it is a text classification task. In this project, we need to:
- Text preprocessing, such as stop word filtering, low-frequency word filtering, special symbol filtering, etc
- Three methods will be used to convert text into vectors: TF IDF, word2vec and BERT vector.
- Logistic regression model training and cross validation
- Evaluate the accuracy of the model
1, Data loading
1. Load package
The first is to load the library. The specific functions of these library functions will be explained when they are used below.
#Import the basic package of data processing import numpy as np import pandas as pd #Import package for counting from collections import Counter #Import TF IDF related packages from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer #Import package for model evaluation from sklearn import metrics #Import word2vec related packages from gensim.models import KeyedVectors #Import the package related to bert embedding. Refer to the Experimental Manual for notes on downloading mxnet package from bert_embedding import BertEmbedding import mxnet #Package tqdm is used to generate a progress bar when executing an iterative object to monitor the running process of the program from tqdm import tqdm #Import some other feature packs import requests import os
2. Read data
The next step is through PD read_ csv function reads data. This function is used to read files in csv format and convert table data into dataframe format. Since we only need the two columns of comments and scores, we can get the corresponding data through the index.
#Read data data = pd.read_csv('data/DMSC.csv') #Observation data format data.head() #Some relevant information of output data data.info() #Only the two columns we need in the data are retained: the Comment column and the Star column data = data[['Comment','Star']] #Observe the format of the new data data.head()
Output result:
ID | Movie_Name_EN | Movie_Name_CN | Crawl_Date | Number | Username | Date | Star | Comment | Like | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Avengers Age of Ultron | Avengers 2 | 2017-01-22 | 1 | Ran pan | 2015-05-13 | 3 | Even aochuang knows that cosmetic surgery is going to South Korea. | 2404 |
1 | 10 | Avengers Age of Ultron | Avengers 2 | 2017-01-22 | 11 | Shadow Chronicle | 2015-04-30 | 4 | "A man without a dark side is not trustworthy." The second part peels off the lengthy bedding. The beginning is the climax, and until the end, someone will feel | 381 |
2 | 20 | Avengers Age of Ultron | Avengers 2 | 2017-01-22 | 21 | Flu at any time | 2015-04-28 | 2 | Aochuang weak explosion, weak explosion, weak explosion!!!!!! | 120 |
3 | 30 | Avengers Age of Ultron | Avengers 2 | 2017-01-22 | 31 | Crow fire hall | 2015-05-08 | 4 | Different from the first episode, it is a connecting link between the preceding and the following, gloomy and serious, but it won't be bad, unless you don't like Marvel movies. The scene is more grand | 30 |
4 | 40 | Avengers Age of Ultron | Avengers 2 | 2017-01-22 | 41 | Office sweetheart | 2015-05-10 | 5 | After watching, I said excitedly to my friend, wait, what if aochuang is going to destroy Taipei? She patted me on the shoulder. It's okay. Anyway, you bought two | 16 |
Comment | Star | |
---|---|---|
0 | Even aochuang knows that cosmetic surgery is going to South Korea. | 3 |
1 | "A man without a dark side is not trustworthy." The second part peels off the lengthy bedding. The beginning is the climax, and until the end, someone will feel | 4 |
2 | Aochuang weak explosion, weak explosion, weak explosion!!!!!! | 2 |
3 | Different from the first episode, it is a connecting link between the preceding and the following, gloomy and serious, but it won't be bad, unless you don't like Marvel movies. The scene is more grand | 4 |
4 | After watching, I said excitedly to my friend, wait, what if aochuang is going to destroy Taipei? She patted me on the shoulder. It's okay. Anyway, you bought two | 5 |
II. Text processing
1. Remove useless characters
Through regular matching, emoticons and other characters, residual colons and symbols and spaces are removed.
# TODO1: remove some useless characters, determine a character geometry and remove it from the text # your to do #Remove alphanumeric expressions and other characters import re def clear_character(sentence): pattern1='[a-zA-Z0-9]' pattern2 = re.compile(u'[^\s1234567890:: ' + '\u4e00-\u9fa5]+') pattern3='['!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+' line1=re.sub(pattern1,'',sentence) #Remove English letters and numbers line2=re.sub(pattern2,'',line1) #Remove emoticons and other characters line3=re.sub(pattern3,'',line2) #Remove residual colons and other symbols new_sentence=''.join(line3.split()) #Remove blank return new_sentence data["comment_processed"]=data['Comment'].apply(clear_character) data.head()
Output result:
Comment | Star | comment_processed | |
---|---|---|---|
0 | Even aochuang knows that cosmetic surgery is going to South Korea. | 3 | Even aochuang knows that cosmetic surgery is going to South Korea |
1 | "A man without a dark side is not trustworthy." The second part peels off the lengthy bedding. The beginning is the climax, and until the end, someone will feel | 4 | A person without a dark side is not trustworthy. The second film strips off the lengthy bedding. From the beginning, that is, the climax to the end, some people will feel that there are only action stunts left |
2 | Aochuang weak explosion, weak explosion, weak explosion!!!!!! | 2 | Aochuang weak explosion, weak explosion, weak explosion |
3 | Different from the first episode, it is a connecting link between the preceding and the following, gloomy and serious, but it won't be bad, unless you don't like Marvel movies. The scene is more grand | 4 | Different from the first episode, the connecting link is gloomy and serious, but it won't be bad, unless you don't like Marvel movies, the scenes are more grand, singles and group war |
4 | After watching, I said excitedly to my friend, wait, what if aochuang is going to destroy Taipei? She patted me on the shoulder. It's okay. Anyway, you bought two | 5 | After watching, I said excitedly to my friend, wait, what if aochuang is going to destroy Taipei? She patted me on the shoulder. It's okay. Anyway, you bought two travel insurance |
2. Text segmentation
Use jieba word segmentation to segment the input text.
# TODO2: import the Chinese word segmentation package jieba and use jieba to segment the original text import jieba def comment_cut(content): # TODO: use stuttering to complete the word segmentation of each comment # seg = jieba.lcut(content) seg = list(jieba.cut(content.strip())) return seg # Output progress bar tqdm.pandas(desc='apply') data['comment_processed'] = data['comment_processed'].progress_apply(comment_cut) # Observe the format of the new data data.head()
Output result:
Comment | Star | comment_processed | |
---|---|---|---|
0 | Even aochuang knows that cosmetic surgery is going to South Korea. | 3 | [Lian, aochuang, all know, cosmetic surgery, going to, Korea] |
1 | "A man without a dark side is not trustworthy." The second part peels off the lengthy bedding. The beginning is the climax, and until the end, someone will feel | 4 | [one, no, dark side, people, no, worthy, trusted, Part II, stripped, lengthy, and |
2 | Aochuang weak explosion, weak explosion, weak explosion!!!!!! | 2 | [altron, weak, explosive, weak, explosive, weak, explosive, weak, explosive, explosive, ah] |
3 | Different from the first episode, it is a connecting link between the preceding and the following, gloomy and serious, but it won't be bad, unless you don't like Marvel movies. The scene is more grand | 4 | [different from the first episode, connecting the preceding and the following, gloomy and serious, but also, not, no, good-looking, ah |
4 | After watching, I said excitedly to my friend, wait, what if aochuang is going to destroy Taipei? She patted me on the shoulder. It's okay. Anyway, you bought two | 5 | [after watching, I, excitedly, yes, friends, say, wait, aochuang, want, come, destroy, Taipei |
3. Remove stop words
Load the stop words list and remove the stop words.
# TODO3: set stop words and remove them from the text # Download the Chinese stop list to data / stopword JSON, download address: https://github.com/goto456/stopwords/ if not os.path.exists('data/stopWord.json'): stopWord = requests.get("https://raw.githubusercontent.com/goto456/stopwords/master/cn_stopwords.txt") with open("data/stopWord.json", "wb") as f: f.write(stopWord.content) # Read the downloaded stop phrase list and save it in the list with open("data/stopWord.json","r",encoding='utf-8') as f: stopWords = f.read().split("\n") # Remove stop words def rm_stop_word(wordList): # your code, remove stop words # TODO filtered_words = [word for word in wordList if word not in stopWords] return filtered_words #In this line of code progress_ The apply() function is equivalent to The function of apply () is written as progress_ The apply() function can be monitored by the tqdm package to output the progress bar. data['comment_processed'] = data['comment_processed'].progress_apply(rm_stop_word) # Observe the format of the new data data.head()
Output result:
Comment | Star | comment_processed | |
---|---|---|---|
0 | Even aochuang knows that cosmetic surgery is going to South Korea. | 3 | [aochuang, know, cosmetic surgery, Korea] |
1 | "A man without a dark side is not trustworthy." The second part peels off the lengthy bedding. The beginning is the climax, and until the end, someone will feel | 4 | [one, no, dark side, worthy of trust, Part II, stripping, lengthy, bedding, opening, climax |
2 | Aochuang weak explosion, weak explosion, weak explosion!!!!!! | 2 | [aochuang, weak, explosive, weak, explosive, weak, explosive] |
3 | Different from the first episode, it is a connecting link between the preceding and the following, gloomy and serious, but it won't be bad, unless you don't like Marvel movies. The scene is more grand | 4 | [Episode 1, different, connecting the preceding and the following, gloomy, serious, not good-looking, originally, like, marvel, film |
4 | After watching, I said excitedly to my friend, wait, what if aochuang is going to destroy Taipei? She patted me on the shoulder. It's okay. Anyway, you bought two | 5 | [after watching, excited, friend, say, aochuang, destruction, Taipei, thick, pat, shoulder, nothing, anyway |
4. Remove low-frequency words
Through the pandas index loop comment column, all words are merged into one list. Then count the word frequency through Counter, and remove the words with word frequency less than 10.
# TODO4: remove low-frequency words, remove words with word frequency less than 10, and store the results in data['comment_processed '] import jieba import re import pandas as pd from collections import Counter data.head() list_set = [] for i in range(len(data)): for j in data.iloc[i]['comment_processed']: list_set.extend(j) words_count = Counter(list_set) min_threshold=10 my_dict = {k: v for k, v in words_count.items() if v < min_threshold} filteredA = Counter(my_dict) # Remove low frequency words def rm_low_frequence_word(wordList): # your code, remove stop words # TODO filtered_words = [word for word in wordList if word not in filteredA] return filtered_words #In this line of code progress_ The apply() function is equivalent to The function of apply () is written as progress_ The apply() function can be monitored by the tqdm package to output the progress bar. data['comment_processed'] = data['comment_processed'].progress_apply(rm_low_frequence_word) data.head()
5. Divide training set and test set
Select 20% of the corpus as the test data and the rest as the training data. The data is divided into training set and test set. comments_train (list) saves the text for training, comments_test(list) saves the text used for the test. y_train, y_test is the corresponding label (1,2,3,4,5)
from sklearn.model_selection import train_test_split X = data['comment_processed'] y = data['Star'] test_ratio = 0.2 comments_train, comments_test, y_train, y_test = train_test_split(X, y, test_size=test_ratio, random_state=0) print(comments_train.head(),y_train.head)
Output result:
104861 [no, read, novel, really, film, impressed, two major, female owner, beauty, acting, great, maybe 190626 [Jigong, Lao, later, one, expectation, director, plot, picture, needless to say, finally, Hua Hua, Liu 198677 [reputation, picture, can't find, praise, 43 years old, Xinhai, sincerity, should, make, change, inside, things] 207320 [rabbit, cute] 106219 [hope, in the end, I can live like...] Name: comment_processed, dtype: object <bound method NDFrame.head of 104861 5 190626 4 198677 3 207320 5 106219 4 .. 176963 4 117952 2 173685 3 43567 5 199340 4 Name: Star, Length: 170004, dtype: int64>
3, Convert text into vector form
In this section, we will convert text into vectors in three different ways:
- Using TF IDF vectors
- Using word2vec
- Using bert vectors
After being converted into vectors, the model is trained.
1. Convert text into TF IDF vector
Through the feature of sklearn_ extraction. text. TfidfTransformer module converts training text and test text into TF IDF vector. Since the word list output by the TfidfTransformer module is connected with spaces instead of commas, you need to convert the following list form first.
from sklearn.feature_extraction.text import TfidfTransformer comments_train1 = [' '.join(i) for i in comments_train] comments_test1 = [' '.join(i) for i in comments_test] print(comments_train[0:5]) print(comments_train1[0:5]) tfidf2 = TfidfTransformer() counter = CountVectorizer(analyzer='word') # counts = counter.fit_transform(comments_train1) tfidf_train = tfidf2.fit_transform(counter.fit_transform(comments_train1)) tfidf_test=tfidf2.transform(counter.transform(comments_test1)) print(tfidf_train.shape,tfidf_test.shape)
Output result:
104861 [no, read, novel, really, film, impressed, two major, female owner, beauty, acting, great, maybe 190626 [Jigong, Lao, later, one, expectation, director, plot, picture, needless to say, finally, Hua Hua, Liu 198677 [reputation, picture, can't find, praise, 43 years old, Xinhai, sincerity, should, make, change, inside, things] 207320 [rabbit, cute] 106219 [hope, in the end, I can live like...] Name: comment_processed, dtype: object ['I haven't read a novel, but I really admire the two women's great acting skills. Maybe a girl will be jealous and love friendship for a while', 'after Mr. Gong, I expected the director's plot picture. I don't have to say. Finally, I burst into tears and was a little embarrassed. After that, I finally realized that Japan's delicate style film industry occupies an important place. I knew someone respected Japan before The film now understands that no one can be better than the cure department at present ',' no praise can be found in the picture ',' 43 year old xinhaicheng should make something more inside ',' the rabbit is so cute ',' I hope I can live like that in the end '] (170004, 87026) (42502, 87026)
2. Convert text into word2vec vector
Because training an efficient word2vec word vector often requires a very large corpus and computing resources, we usually don't train Wordvec word vector ourselves, but directly use the trained word vector that is open-source on the Internet. sgns.zhihu.word is from Chinese-Word-Vectors Download the pre trained Chinese word vector file.
Via keyedvectors load_ word2vec_ The format() function loads the pre trained word vector file.
For each sentence, the vector of the sentence is generated. The specific method is to average the vectors of all words contained in the sentence.
model = KeyedVectors.load_word2vec_format('data/sgns.zhihu.word') model['today'] vocabulary = model.vocab vec_lem=model['Lu Xun'].shape[0] def comm_vec(c): vec_com=np.zeros(vec_lem) coun=0 for w in c: if w in model: vec_com+=model[w] coun+=1 return vec_com/coun word2vec_train=np.vstack(comments_train.progress_apply(comm_vec)) word2vec_test=np.vstack(comments_test.progress_apply(comm_vec))
3. Convert text into bert vector
transformers is a pre training model library provided by huggingface. You can easily call the API to get the word vector.
Next, we mainly introduce how to call the transformers library to generate Chinese word vectors. The pre training model used is Bert_mini, a Chinese text pre training model. Through the berttokenizer, the bertmodel function converts words into vectors.
The next step is through process_ The word() function spell the obtained word vector into a word vector, and then use comm_ The vec() function spell word vectors into sentence vectors.
from transformers import BertTokenizer,BertModel import torch import logging # set cuda gpu = 0 use_cuda = gpu >= 0 and torch.cuda.is_available() print(use_cuda) if use_cuda: torch.cuda.set_device(gpu) device = torch.device("cuda", gpu) else: device = torch.device("cpu") logging.info("Use cuda: %s, gpu id: %d.", use_cuda, gpu) bert_model_dir='bert-mini' tokenizer=BertTokenizer.from_pretrained(bert_model_dir) Bertmodel=BertModel.from_pretrained(bert_model_dir) word=['Today I'] input_id=tokenizer(word,padding=True,truncation=True,max_length=10,return_tensors='pt') result=Bertmodel(input_id['input_ids']) # print(result) vec_len=len(result[0][0][1]) # vec_len=result[0,1,0].shape[0] print(vec_len) def process_word(w): vec_com=np.zeros(vec_len) num=len(w) input_id=tokenizer(w,padding=True,truncation=True,max_length=10,return_tensors='pt') res=Bertmodel(input_id['input_ids']) k=len(res[0][0]) for i in range(k): # print(res[0][0][i].detach().numpy()) vec_com+=res[0][0][i].detach().numpy() return vec_com/k def comm_vec(c): vec_com=np.zeros(vec_len) coun=0 for w in c: if w in model: vec_com+=process_word(w) coun+=1 break return vec_com/coun bert_train=np.vstack(comments_train.progress_apply(comm_vec)) bert_test=np.vstack(comments_test.progress_apply(comm_vec)) print (tfidf_train.shape, tfidf_test.shape) print (word2vec_train.shape, word2vec_test.shape) print (bert_train.shape, bert_test.shape)
Output result:
(170004, 87026) (42502, 87026)
(170004, 300) (42502, 300)
(170004, 256) (42502, 256)
4, Training model and evaluation
To train the logistic regression model for the above three different vector representations, we need to do the following:
- Build model
- Training model (and cross validation)
- Output the best results
Import packages for logistic regression and cross validation.
# Import package for logistic regression from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV
1. Use TF IDF vector training
clf=LogisticRegression() param_grid = { 'C': [0.01,0.1, 1.0, 2.0,10,100], 'penalty' : ['l1', 'l2'] } grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='accuracy', cv=5, n_jobs=-1) grid_search.fit(tfidf_train, y_train) print(grid_search.best_params_) print(grid_search.best_score_) lr_best=LogisticRegression(penalty='l2',C=2) lr_best.fit(tfidf_train, y_train) tf_idf_y_pred=lr_best.predict(tfidf_test) print('TF-IDF LR test accuracy %s' % metrics.accuracy_score(y_test, tf_idf_y_pred)) #Application of logistic regression model on test set_ Score print('TF-IDF LR test F1_score %s' % metrics.f1_score(y_test, tf_idf_y_pred,average="macro"))
Output result:
{'C': 1.0, 'penalty': 'l2'} 0.4747594131314477
TF-IDF LR test accuracy 0.47851865794550846 TF-IDF LR test F1_score 0.43133900686271376
2. Use word2vec vector training
clf=LogisticRegression() param_grid = { 'C': [0.01,0.1, 1.0, 2.0,10,100], 'penalty' : ['l1', 'l2'], 'solver':['liblinear','lbfgs','sag','saga'] } grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='accuracy', cv=5, n_jobs=-1) grid_search.fit(word2vec_train, y_train) print(grid_search.best_params_) print(grid_search.best_score_) lr_best=LogisticRegression(penalty='l1',C=100,solver='saga') lr_best.fit(word2vec_train, y_train) word2vec_y_pred=lr_best.predict(word2vec_test) print('Word2vec LR test accuracy %s' % metrics.accuracy_score(y_test, word2vec_y_pred)) #Application of logistic regression model on test set_ Score print('Word2vec LR test F1_score %s' % metrics.f1_score(y_test, word2vec_y_pred,average="macro"))
Output result:
{'C': 1.0, 'penalty': 'l2', 'solver': 'lbfgs'} 0.4425013587835652
Word2vec LR test accuracy 0.4447555409157216 Word2vec LR test F1_score 0.37275840165350765
3. Use Bert vector training
clf=LogisticRegression() param_grid = { 'C': [0.01,0.1, 1.0, 2.0,10,100], 'penalty' : ['l1', 'l2'], 'solver':['liblinear','lbfgs','sag','saga'] } grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='accuracy', cv=5, n_jobs=-1) grid_search.fit(bert_train, y_train) print(grid_search.best_params_) print(grid_search.best_score_)
Output result:
{'C': 2.0, 'penalty': 'l2', 'solver': 'sag'} 0.4072104199357458
Bert LR test accuracy 0.4072090725142346 Bert LR test F1_score 0.3471216422860073
summary
Through the training of word vectors obtained in different ways, the accuracy of five classification is more than 40%, and the gap is small. It may be that only the logistic regression model is used, and the effect is not much improved. Therefore, the effect can be improved from the following aspects:
- Improvement of sentence vector fusion method
- Solve the problem of category imbalance
- Word vector model retraining
- Improvement of jieba word segmentation effect
- Classification with deep neural network
The above is the most detailed introductory tutorial in the history of text classification. The codeword is not easy. I hope you can like the collection more. If you need code or model, you can leave a message or private letter in the comment area.
reference resources:
Greedy college natural language processing