Ten thousand word summary Keras deep learning Chinese text classification

Posted by gskurski on Thu, 13 Jan 2022 04:11:53 +0100

Abstract: This article will explain in detail Keras's implementation of classical deep learning text classification algorithms, including LSTM, BiLSTM, BiLSTM+Attention, CNN and TextCNN.

This article is shared from Huawei cloud community< Keras deep learning Chinese text classification ten thousand word summary (CNN, TextCNN, BiLSTM, attention) >, by eastmount.

I Text classification overview

Text classification aims to automatically classify and mark the text set according to a certain classification system or standard. It belongs to an automatic classification based on the classification system. Text classification can be traced back to the 1950s, when text classification was mainly carried out through expert defined rules; In the 1980s, the expert system established by knowledge engineering appeared; In the 1990s, text classification was carried out through artificial feature engineering and shallow classification model with the help of machine learning method. At present, word vector and deep neural network are mostly used for text classification.

Teacher Niu Yafeng summarized the traditional text classification process as shown in the figure below. In the traditional text classification, most machine learning methods are basically applied in the field of text classification. It mainly includes:

Naive Bayes
KNN
SVM
Collection class method
Maximum entropy
neural network

The basic process of text classification using Keras framework is as follows:

Step 1: text preprocessing, word segmentation - > remove stop words - > statistics, select top n words as feature words
Step 2: generate ID for each feature word
Step 3: convert the text into ID sequence and fill the left
Step 4: training set shuffle
Step 5: Embedding Layer converts words into word vectors
Step 6: add the model and construct the neural network structure
Step 7: Training Model
Step 8: get the accuracy, recall and F1 value

Note that if TFIDF is used instead of word vector for document representation, the TFIDF matrix is generated after word segmentation and stop, and then the model is input.

Deep learning text classification methods include:

Convolutional neural network (TextCNN)
Cyclic neural network (TextRNN)
TextRNN+Attention
TextRCNN(TextRNN+CNN)
BiLSTM+Attention
Transfer learning

Recommend teacher Niu Yafeng's article:

Text classification based on word2vec and CNN: Review & Practice

II Data preprocessing and word segmentation

This article mainly focuses on the code, and the algorithm principle knowledge will be supplemented by the previous articles and subsequent articles. The data set is shown in the following figure:

Training set: news_dataset_train.csv
Game theme (10000), sports theme (10000), cultural theme (10000), financial theme (10000)
Test set: news_dataset_test.csv
Game theme (5000), sports theme (5000), cultural theme (5000), financial theme (5000)
Validation set: news_dataset_val.csv
Game theme (5000), sports theme (5000), cultural theme (5000), financial theme (5000)

First, we need to preprocess Chinese word segmentation and call Jieba library. The code is as follows:

data_preprocess.py

# -*- coding:utf-8 -*-
# By:Eastmount CSDN 2021-03-19
import csv
import pandas as pd
import numpy as np
import jieba
import jieba.analyse

#Add custom dictionaries and deactivate Dictionaries
jieba.load_userdict("user_dict.txt")
stop_list = pd.read_csv('stop_words.txt',
                        engine='python',
                        encoding='utf-8',
                        delimiter="\n",
                        names=['t'])['t'].tolist()

#-----------------------------------------------------------------------
#Jieba word segmentation function
def txt_cut(juzi):
    return [w for w in jieba.lcut(juzi) if w not in stop_list]

#-----------------------------------------------------------------------
#Chinese word segmentation reading file
def fenci(filename,result):
    #Write word segmentation results
    fw = open(result, "w", newline = '',encoding = 'gb18030')
    writer = csv.writer(fw)  
    writer.writerow(['label','cutword'])

    #Using CSV Dictreader reads the information in the file
    labels = []
    contents = []
    with open(filename, "r", encoding="UTF-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            #Data element acquisition
            labels.append(row['label'])
            content = row['content']
            #Chinese word segmentation
            seglist = txt_cut(content)
            #Space splicing
            output = ' '.join(list(seglist))
            contents.append(output)
 
            #File write
            tlist = []
            tlist.append(row['label'])
            tlist.append(output)
            writer.writerow(tlist)
    print(labels[:5])
    print(contents[:5])
    fw.close()

#-----------------------------------------------------------------------
#Main function
if __name__ == '__main__':
    fenci("news_dataset_train.csv", "news_dataset_train_fc.csv")
    fenci("news_dataset_test.csv", "news_dataset_test_fc.csv")
    fenci("news_dataset_val.csv", "news_dataset_val_fc.csv")

The operation results are shown in the figure below:

Then we try to simply view the length distribution of data and label visualization.

data_show.py

# -*- coding: utf-8 -*-
"""
Created on 2021-03-19
@author: xiuzhang Eastmount CSDN
"""
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns

#---------------------------------------Step 1 data reading------------------------------------
## Read test data set
train_df = pd.read_csv("news_dataset_train_fc.csv")
val_df = pd.read_csv("news_dataset_val_fc.csv")
test_df = pd.read_csv("news_dataset_test_fc.csv")
print(train_df.head())

## Solve the problem of Chinese display
plt.rcParams['font.sans-serif'] = ['KaiTi']  #Specifies the default font SimHei bold
plt.rcParams['axes.unicode_minus'] = False   #Solution: the saved image is a negative sign'

## See which tags are included in the training set
plt.figure()
sns.countplot(train_df.label)
plt.xlabel('Label',size = 10)
plt.xticks(size = 10)
plt.show()

## Analyze the distribution of the number of phrases in the training set
print(train_df.cutwordnum.describe())
plt.figure()
plt.hist(train_df.cutwordnum,bins=100)
plt.xlabel("Phrase length", size = 12)
plt.ylabel("frequency", size = 12)
plt.title("Training data set")
plt.show()

The output results are shown in the figure below. In the following article, we will introduce how to draw good-looking charts.

Note that if the error "Unicode decodeerror: 'UTF-8' codec can't decode byte 0xce in position 17: invalid continuation byte" is reported, the CSV file needs to be saved in UTF-8 format, as shown in the following figure.

III CNN Chinese text classification

1. Principle introduction

Convolutional neural networks (CNN) is a kind of Feedforward Neural Networks with depth structure including convolution calculation. It is one of the representative algorithms of deep learning. It is usually used in the fields of image recognition and speech recognition, and can give better results. It can also be used in the fields of video analysis, machine translation, natural language processing, drug discovery and so on. The famous alpha dog makes the computer understand that go is based on convolutional neural network.

Convolution refers to not processing each pixel, but processing the image area. This method strengthens the continuity of the image, sees a graph rather than a point, and also deepens the understanding of the image by the neural network.
Usually, the convolution neural network will go through the process of "picture - > convolution - > persistence - > convolution - > persistence - > results are transmitted to two fully connected neural layers - > Classifier", and finally realize the classification processing of a CNN.

2. Code implementation

The CNN code of Keras for text classification is as follows:

Keras_CNN_cnews.py

# -*- coding: utf-8 -*-

"""
Created on 2021-03-19
@author: xiuzhang Eastmount CSDN
CNN Model
"""
import os
import time
import pickle
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.layers import Convolution1D, MaxPool1D, Flatten
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.callbacks import EarlyStopping
from keras.models import load_model
from keras.models import Sequential

## If the GPU processing reader is a CPU, you can comment on this part of the code
## Specifies the maximum amount of video memory used in each GPU process 0.9 Indicates that it can be used GPU 90%Training with resources
os.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

start = time.clock()

#----------------------------Step 1 data reading----------------------------
## Read test data set
train_df = pd.read_csv("news_dataset_train_fc.csv")
val_df = pd.read_csv("news_dataset_val_fc.csv")
test_df = pd.read_csv("news_dataset_test_fc.csv")
print(train_df.head())

## Solve the problem of Chinese display
plt.rcParams['font.sans-serif'] = ['KaiTi']  #Specifies the default font SimHei bold
plt.rcParams['axes.unicode_minus'] = False   #Solution: the saved image is a negative sign'

#--------------------------Step 2 OneHotEncoder()code--------------------
## Encode the label data of the dataset
train_y = train_df.label
val_y = val_df.label
test_y = test_df.label
print("Label:")
print(train_y[:10])

le = LabelEncoder()
train_y = le.fit_transform(train_y).reshape(-1,1)
val_y = le.transform(val_y).reshape(-1,1)
test_y = le.transform(test_y).reshape(-1,1)
print("LabelEncoder")
print(train_y[:10])
print(len(train_y))

## Label data of the dataset-hot code
ohe = OneHotEncoder()
train_y = ohe.fit_transform(train_y).toarray()
val_y = ohe.transform(val_y).toarray()
test_y = ohe.transform(test_y).toarray()
print("OneHotEncoder:")
print(train_y[:10])

#-----------------------Step 3 use Tokenizer Encode phrases--------------------
max_words = 6000
max_len = 600
tok = Tokenizer(num_words=max_words)  #The maximum number of words is 6000
print(train_df.cutword[:5])
print(type(train_df.cutword))

## Prevent digital str processing in Corpus
train_content = [str(a) for a in train_df.cutword.tolist()]
val_content = [str(a) for a in val_df.cutword.tolist()]
test_content = [str(a) for a in test_df.cutword.tolist()]
tok.fit_on_texts(train_content)
print(tok)

#Use fit after the Tokenizer object is created_ on_ The texts () function recognizes each word
#tok.fit_on_texts(train_df.cutword)

## Save the trained Tokenizer and import
with open('tok.pickle', 'wb') as handle: #saving
    pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('tok.pickle', 'rb') as handle: #loading
    tok = pickle.load(handle)

## Using word_ The index attribute looks at the code corresponding to each word
## Using word_ Count property to view the frequency corresponding to each word
for ii,iterm in enumerate(tok.word_index.items()):
    if ii < 10:
        print(iterm)
    else:
        break
print("===================")  
for ii,iterm in enumerate(tok.word_counts.items()):
    if ii < 10:
        print(iterm)
    else:
        break

#---------------------------Step 4 convert data into sequence-----------------------------
## Use sequence pad_ Sequences () adjusts each sequence to the same length
## After coding each word, each word in each news sentence can be represented by the corresponding coding, that is, each news can be transformed into a vector
train_seq = tok.texts_to_sequences(train_content)
val_seq = tok.texts_to_sequences(val_content)
test_seq = tok.texts_to_sequences(test_content)

## Adjust each sequence to the same length
train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)
print("Data conversion sequence")
print(train_seq_mat.shape)
print(val_seq_mat.shape)
print(test_seq_mat.shape)
print(train_seq_mat[:2])

#-------------------------------Step 5 establish CNN Model--------------------------
## 4 categories
num_labels = 4
inputs = Input(name='inputs',shape=[max_len], dtype='float64')
## Word embedding uses pre trained word vectors
layer = Embedding(max_words+1, 128, input_length=max_len, trainable=False)(inputs)
## Convolution layer and pooling layer (word window size is 3 128 cores)
cnn = Convolution1D(128, 3, padding='same', strides = 1, activation='relu')(layer)
cnn = MaxPool1D(pool_size=4)(cnn)
## Dropout prevents overfitting
flat = Flatten()(cnn) 
drop = Dropout(0.3)(flat)
## Full connection layer
main_output = Dense(num_labels, activation='softmax')(drop)
model = Model(inputs=inputs, outputs=main_output)

## Optimization function evaluation index
model.summary()
model.compile(loss="categorical_crossentropy",
              optimizer='adam',      # RMSprop()
              metrics=["accuracy"])

#-------------------------------Step 6 model training and prediction--------------------------
## First set to train training and then set to test test
flag = "train"
if flag == "train":
    print("model training")
    ## Model training-loss Stop training when you no longer improve 0.0001
    model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=10,
                          validation_data=(val_seq_mat,val_y),
                          callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)]  
                         )
    ## Save model
    model.save('my_model.h5')  
    del model  # deletes the existing model
    ## computing time
    elapsed = (time.clock() - start)
    print("Time used:", elapsed)
    print(model_fit.history)
 
else:
    print("model prediction ")
    ## Import the trained model
    model = load_model('my_model.h5')
    ## Predict the test set
    test_pre = model.predict(test_seq_mat)
    ## Evaluate the prediction effect and calculate the confusion matrix
    confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1))
    print(confm)
 
    ## Confusion matrix visualization
    Labname = ["Sports", "Culture", "Finance and Economics", "game"]
    print(metrics.classification_report(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)))
    plt.figure(figsize=(8,8))
    sns.heatmap(confm.T, square=True, annot=True,
                fmt='d', cbar=False, linewidths=.6,
                cmap="YlGnBu")
    plt.xlabel('True label',size = 14)
    plt.ylabel('Predicted label', size = 14)
    plt.xticks(np.arange(4)+0.5, Labname, size = 12)
    plt.yticks(np.arange(4)+0.5, Labname, size = 12)
    plt.savefig('result.png')
    plt.show()

    #----------------------------------Seventh, verify the algorithm--------------------------
    ## tok is used to preprocess the validation data set, and the trained model is used for prediction
    val_seq = tok.texts_to_sequences(val_df.cutword)
    ## Adjust each sequence to the same length
    val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
    ## Predict validation set
    val_pre = model.predict(val_seq_mat)
    print(metrics.classification_report(np.argmax(val_y,axis=1),np.argmax(val_pre,axis=1)))
 
    ## computing time
    elapsed = (time.clock() - start)
    print("Time used:", elapsed)

GPU operation is shown in the figure below. Note that if your computer is a CPU version, you only need to comment out the first part of the above code, and the later LSTM part uses the library function corresponding to GPU.

The training output model is shown in the figure below:

The training output results are as follows:

model training
Train on 40000 samples, validate on 20000 samples
Epoch 1/10
40000/40000 [==============================] - 15s 371us/step - loss: 1.1798 - acc: 0.4772 - val_loss: 0.9878 - val_acc: 0.5977
Epoch 2/10
40000/40000 [==============================] - 4s 93us/step - loss: 0.8681 - acc: 0.6612 - val_loss: 0.8167 - val_acc: 0.6746
Epoch 3/10
40000/40000 [==============================] - 4s 92us/step - loss: 0.7268 - acc: 0.7245 - val_loss: 0.7084 - val_acc: 0.7330
Epoch 4/10
40000/40000 [==============================] - 4s 93us/step - loss: 0.6369 - acc: 0.7643 - val_loss: 0.6462 - val_acc: 0.7617
Epoch 5/10
40000/40000 [==============================] - 4s 96us/step - loss: 0.5670 - acc: 0.7957 - val_loss: 0.5895 - val_acc: 0.7867
Epoch 6/10
40000/40000 [==============================] - 4s 92us/step - loss: 0.5074 - acc: 0.8226 - val_loss: 0.5530 - val_acc: 0.8018
Epoch 7/10
40000/40000 [==============================] - 4s 93us/step - loss: 0.4638 - acc: 0.8388 - val_loss: 0.5105 - val_acc: 0.8185
Epoch 8/10
40000/40000 [==============================] - 4s 93us/step - loss: 0.4241 - acc: 0.8545 - val_loss: 0.4836 - val_acc: 0.8304
Epoch 9/10
40000/40000 [==============================] - 4s 92us/step - loss: 0.3900 - acc: 0.8692 - val_loss: 0.4599 - val_acc: 0.8403
Epoch 10/10
40000/40000 [==============================] - 4s 93us/step - loss: 0.3657 - acc: 0.8761 - val_loss: 0.4472 - val_acc: 0.8457
Time used: 52.203992899999996

The prediction and verification results are as follows:

[[3928  472  264  336]
 [ 115 4529  121  235]
 [ 151  340 4279  230]
 [ 145  593  195 4067]]
             precision    recall  f1-score   support

          0       0.91      0.79      0.84      5000
          1       0.76      0.91      0.83      5000
          2       0.88      0.86      0.87      5000
          3       0.84      0.81      0.82      5000

avg / total       0.85      0.84      0.84     20000

             precision    recall  f1-score   support

          0       0.90      0.77      0.83      5000
          1       0.78      0.92      0.84      5000
          2       0.88      0.85      0.86      5000
          3       0.84      0.85      0.85      5000

avg / total       0.85      0.85      0.85     20000

IV TextCNN Chinese text classification

1. Principle introduction

TextCNN is an algorithm for text classification using convolutional neural network, which was proposed by Yoon Kim in the article "revolutionary neural networks for sense classification" in 2014.

The core idea of convolutional neural network is to capture local features. For text, local features are sliding windows composed of several words, similar to N-gram. The advantage of convolutional neural network is that it can automatically combine and screen N-gram features to obtain semantic information at different levels of abstraction. The following figure is the convolution neural network model architecture used for text classification in this paper.

Another classic paper of TextCNN is A Sensitivity Analysis of (and Practitioners' Guide to) revolutionary neural networks for sensitivity classification. The model results are shown in the figure below. TextCNN structure description, which is mainly used for text classification task, explains in detail how TextCNN architecture and word vector matrix convolute.

Suppose we have some sentences that need to be classified. Each word in the sentence is composed of an n-dimensional word vector, that is, the size of the input matrix is m*n, where m is the length of the sentence. CNN needs to convolute the input samples. For text data, the filter no longer slides horizontally, but only moves downward, which is somewhat similar to N-gram in extracting the local correlation between words.

There are three step strategies in the figure, namely 2, 3 and 4. Each step has two filters (there will be a large number of filters during actual training). Using different filters on different word windows, six convoluted vectors are finally obtained. Then, each vector is maximized and pooled, and each pooled value is spliced. Finally, the feature representation of the sentence is obtained. The sentence vector is thrown to the classifier for classification, and finally the whole text classification process is completed.

Finally, I sincerely recommend the following leaders' introduction to TextCNN, especially the Asia Lee boss of CSDN. I like his article very much. It's really great!

<Convolutional Neural Networks for Sentence Classification2014>
<A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification>
https://blog.csdn.net/asialee_bird/article/details/88813385
https://zhuanlan.zhihu.com/p/77634533

2. Code implementation

The TextCNN code used by Keras to realize text classification is as follows:

Keras_TextCNN_cnews.py

# -*- coding: utf-8 -*-
"""
Created on 2021-03-19
@author: xiuzhang Eastmount CSDN
TextCNN Model
"""
import os
import time
import pickle
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.layers import Convolution1D, MaxPool1D, Flatten
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.callbacks import EarlyStopping
from keras.models import load_model
from keras.models import Sequential
from keras.layers.merge import concatenate

## If the GPU processing reader is a CPU, you can comment on this part of the code
## Specifies the maximum amount of video memory used in each GPU process 0.9 Indicates that it can be used GPU 90%Training with resources
os.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

start = time.clock()

#----------------------------Step 1 data reading----------------------------
## Read test data set
train_df = pd.read_csv("news_dataset_train_fc.csv")
val_df = pd.read_csv("news_dataset_val_fc.csv")
test_df = pd.read_csv("news_dataset_test_fc.csv")

## Solve the problem of Chinese display
plt.rcParams['font.sans-serif'] = ['KaiTi']  #Specifies the default font SimHei bold
plt.rcParams['axes.unicode_minus'] = False   #Solution: the saved image is a negative sign'

#--------------------------Step 2 OneHotEncoder()code--------------------
## Encode the label data of the dataset
train_y = train_df.label
val_y = val_df.label
test_y = test_df.label
print("Label:")
print(train_y[:10])

le = LabelEncoder()
train_y = le.fit_transform(train_y).reshape(-1,1)
val_y = le.transform(val_y).reshape(-1,1)
test_y = le.transform(test_y).reshape(-1,1)
print("LabelEncoder")
print(train_y[:10])
print(len(train_y))

## Label data of the dataset-hot code
ohe = OneHotEncoder()
train_y = ohe.fit_transform(train_y).toarray()
val_y = ohe.transform(val_y).toarray()
test_y = ohe.transform(test_y).toarray()
print("OneHotEncoder:")
print(train_y[:10])

#-----------------------Step 3 use Tokenizer Encode phrases--------------------
max_words = 6000
max_len = 600
tok = Tokenizer(num_words=max_words)  #The maximum number of words is 6000
print(train_df.cutword[:5])
print(type(train_df.cutword))

## Prevent digital str processing in Corpus
train_content = [str(a) for a in train_df.cutword.tolist()]
val_content = [str(a) for a in val_df.cutword.tolist()]
test_content = [str(a) for a in test_df.cutword.tolist()]
tok.fit_on_texts(train_content)
print(tok)

## Save the trained Tokenizer and import
with open('tok.pickle', 'wb') as handle: #saving
    pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('tok.pickle', 'rb') as handle: #loading
    tok = pickle.load(handle)

#---------------------------Step 4 convert data into sequence-----------------------------
train_seq = tok.texts_to_sequences(train_content)
val_seq = tok.texts_to_sequences(val_content)
test_seq = tok.texts_to_sequences(test_content)

## Adjust each sequence to the same length
train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)
print("Data conversion sequence")
print(train_seq_mat.shape)
print(val_seq_mat.shape)
print(test_seq_mat.shape)
print(train_seq_mat[:2])

#-------------------------------Step 5 establish TextCNN Model--------------------------
## 4 categories
num_labels = 4
inputs = Input(name='inputs',shape=[max_len], dtype='float64')
## Word embedding uses pre trained word vectors
layer = Embedding(max_words+1, 256, input_length=max_len, trainable=False)(inputs)
## The word window size is 3 respectively,4,5
cnn1 = Convolution1D(256, 3, padding='same', strides = 1, activation='relu')(layer)
cnn1 = MaxPool1D(pool_size=4)(cnn1)
cnn2 = Convolution1D(256, 4, padding='same', strides = 1, activation='relu')(layer)
cnn2 = MaxPool1D(pool_size=4)(cnn2)
cnn3 = Convolution1D(256, 5, padding='same', strides = 1, activation='relu')(layer)
cnn3 = MaxPool1D(pool_size=4)(cnn3)

# Merge the output vectors of the three models
cnn = concatenate([cnn1,cnn2,cnn3], axis=-1)
flat = Flatten()(cnn) 
drop = Dropout(0.2)(flat)
main_output = Dense(num_labels, activation='softmax')(drop)
model = Model(inputs=inputs, outputs=main_output)
model.summary()
model.compile(loss="categorical_crossentropy",
              optimizer='adam',      # RMSprop()
              metrics=["accuracy"])

#-------------------------------Step 6 model training and prediction--------------------------
## First set to train training and then set to test test
flag = "train"
if flag == "train":
    print("model training")
    ## Model training-loss Stop training when you no longer improve 0.0001
    model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=10,
                          validation_data=(val_seq_mat,val_y),
                          callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)]  
                         )
    model.save('my_model.h5')  
    del model
    elapsed = (time.clock() - start)
    print("Time used:", elapsed)
    print(model_fit.history)
 
else:
    print("model prediction ")
    ## Import the trained model
    model = load_model('my_model.h5')
    ## Predict the test set
    test_pre = model.predict(test_seq_mat)
    ## Evaluate the prediction effect and calculate the confusion matrix
    confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1))
    print(confm)
 
    ## Confusion matrix visualization
    Labname = ["Sports", "Culture", "Finance and Economics", "game"]
    print(metrics.classification_report(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)))
    plt.figure(figsize=(8,8))
    sns.heatmap(confm.T, square=True, annot=True,
                fmt='d', cbar=False, linewidths=.6,
                cmap="YlGnBu")
    plt.xlabel('True label',size = 14)
    plt.ylabel('Predicted label', size = 14)
    plt.xticks(np.arange(4)+0.5, Labname, size = 12)
    plt.yticks(np.arange(4)+0.5, Labname, size = 12)
    plt.savefig('result.png')
    plt.show()

    #----------------------------------Seventh, verify the algorithm--------------------------
    ## tok is used to preprocess the validation data set, and the trained model is used for prediction
    val_seq = tok.texts_to_sequences(val_df.cutword)
    ## Adjust each sequence to the same length
    val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
    ## Predict validation set
    val_pre = model.predict(val_seq_mat)
    print(metrics.classification_report(np.argmax(val_y,axis=1),np.argmax(val_pre,axis=1)))
    elapsed = (time.clock() - start)
    print("Time used:", elapsed)

The training model is as follows:

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
inputs (InputLayer)             (None, 600)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 600, 256)     1536256     inputs[0][0]                     
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 600, 256)     196864      embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_2 (Conv1D)               (None, 600, 256)     262400      embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_3 (Conv1D)               (None, 600, 256)     327936      embedding_1[0][0]                
__________________________________________________________________________________________________
max_pooling1d_1 (MaxPooling1D)  (None, 150, 256)     0           conv1d_1[0][0]                   
__________________________________________________________________________________________________
max_pooling1d_2 (MaxPooling1D)  (None, 150, 256)     0           conv1d_2[0][0]                   
__________________________________________________________________________________________________
max_pooling1d_3 (MaxPooling1D)  (None, 150, 256)     0           conv1d_3[0][0]                   
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 150, 768)     0           max_pooling1d_1[0][0]            
                                                                 max_pooling1d_2[0][0]            
                                                                 max_pooling1d_3[0][0]            
__________________________________________________________________________________________________
flatten_1 (Flatten)             (None, 115200)       0           concatenate_1[0][0]              
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 115200)       0           flatten_1[0][0]                  
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 4)            460804      dropout_1[0][0]                  
==================================================================================================
Total params: 2,784,260
Trainable params: 1,248,004
Non-trainable params: 1,536,256
__________________________________________________________________________________________________

The predicted results are as follows:

[[4448  238  182  132]
 [ 151 4572  124  153]
 [ 185  176 4545   94]
 [ 181  394  207 4218]]
             precision    recall  f1-score   support

          0       0.90      0.89      0.89      5000
          1       0.85      0.91      0.88      5000
          2       0.90      0.91      0.90      5000
          3       0.92      0.84      0.88      5000

avg / total       0.89      0.89      0.89     20000

             precision    recall  f1-score   support

          0       0.90      0.88      0.89      5000
          1       0.86      0.93      0.89      5000
          2       0.91      0.89      0.90      5000
          3       0.92      0.88      0.90      5000

avg / total       0.90      0.90      0.90     20000

V LSTM Chinese text classification

1. Principle introduction

Long Short Term network (LSTM) is a special type of RNN (Recurrent Neural Network), which can learn long-term dependence information. LSTM was proposed by Hochreiter & schmidhuber (1997) and recently improved and popularized by Alex Graves. In many problems, LSTM has achieved considerable success and has been widely used.

Because RNN has the problem of gradient disappearance, people have improved the hidden structure of sequence index position t. some techniques are used to make the hidden structure complex to avoid the problem of gradient disappearance. Such a special RNN is our LSTM. The full name of LSTM is long short term memory. Because of its design characteristics, LSTM is very suitable for modeling time series data, such as text data. The structure of LSTM is shown in the figure below:

LSTM is deliberately designed to avoid long-term dependency problems. Remember that long-term information is the default behavior of LSTM in practice, not the ability to obtain it at a great cost. LSTM has made some improvements on the ordinary RNN. LSTM RNN has three more controllers, namely:

Input controller
Output controller
Forget controller

There are more mainlines on the left, such as the main plot of the film, while the original RNN system has become a branch plot, and the three controllers are on the branch.

Input controller (write gate): set a gate when inputting input. The function of gate is to judge whether to write this input into our Memory. It is equivalent to a parameter and can be trained. This parameter is used to control whether to remember the current point.
Output controller (read gate): in the gate of the output position, judge whether to read the current Memory.
Forget gate: the forget gate of the processing position to determine whether to forget the previous Memory.

The working principle of LSTM is: if the branch content is very important for the final result, the input controller will write the branch content into the main line content according to the importance, and then analyze it; If the branching content changes our previous idea, the forgetting controller will forget some mainline content and replace the new content proportionally, so the update of mainline content depends on input and forgetting control; The final output will be based on the main line content and branch line content. Through these three gate s, we can well control our RNN. Based on these control mechanisms, LSTM is a good medicine to delay memory, so as to bring better results.

2. Code implementation

The LSTM code of text classification implemented by Keras is as follows:

Keras_LSTM_cnews.py

"""
Created on 2021-03-19
@author: xiuzhang Eastmount CSDN
LSTM Model
"""
import os
import time
import pickle
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.layers import Convolution1D, MaxPool1D, Flatten
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.callbacks import EarlyStopping
from keras.models import load_model
from keras.models import Sequential

#GPU accelerates CuDNNLSTM faster than LSTM
from keras.layers import CuDNNLSTM, CuDNNGRU

## If the GPU processing reader is a CPU, you can comment on this part of the code
## Specifies the maximum amount of video memory used in each GPU process 0.9 Indicates that it can be used GPU 90%Training with resources
os.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

start = time.clock()

#----------------------------Step 1 data reading----------------------------
## Read test data set
train_df = pd.read_csv("news_dataset_train_fc.csv")
val_df = pd.read_csv("news_dataset_val_fc.csv")
test_df = pd.read_csv("news_dataset_test_fc.csv")
print(train_df.head())

## Solve the problem of Chinese display
plt.rcParams['font.sans-serif'] = ['KaiTi']  #Specifies the default font SimHei bold
plt.rcParams['axes.unicode_minus'] = False   #Solution: the saved image is a negative sign'

#--------------------------Step 2 OneHotEncoder()code--------------------
## Encode the label data of the dataset
train_y = train_df.label
val_y = val_df.label
test_y = test_df.label
print("Label:")
print(train_y[:10])

le = LabelEncoder()
train_y = le.fit_transform(train_y).reshape(-1,1)
val_y = le.transform(val_y).reshape(-1,1)
test_y = le.transform(test_y).reshape(-1,1)
print("LabelEncoder")
print(train_y[:10])
print(len(train_y))

## Label data of the dataset-hot code
ohe = OneHotEncoder()
train_y = ohe.fit_transform(train_y).toarray()
val_y = ohe.transform(val_y).toarray()
test_y = ohe.transform(test_y).toarray()
print("OneHotEncoder:")
print(train_y[:10])

#-----------------------Step 3 use Tokenizer Encode phrases--------------------
max_words = 6000
max_len = 600
tok = Tokenizer(num_words=max_words)  #The maximum number of words is 6000
print(train_df.cutword[:5])
print(type(train_df.cutword))

## Prevent digital str processing in Corpus
train_content = [str(a) for a in train_df.cutword.tolist()]
val_content = [str(a) for a in val_df.cutword.tolist()]
test_content = [str(a) for a in test_df.cutword.tolist()]
tok.fit_on_texts(train_content)
print(tok)

## Save the trained Tokenizer and import
with open('tok.pickle', 'wb') as handle: #saving
    pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('tok.pickle', 'rb') as handle: #loading
    tok = pickle.load(handle)

#---------------------------Step 4 convert data into sequence-----------------------------
train_seq = tok.texts_to_sequences(train_content)
val_seq = tok.texts_to_sequences(val_content)
test_seq = tok.texts_to_sequences(test_content)

## Adjust each sequence to the same length
train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)
print("Data conversion sequence")
print(train_seq_mat.shape)
print(val_seq_mat.shape)
print(test_seq_mat.shape)
print(train_seq_mat[:2])

#-------------------------------Step 5 establish LSTM Model--------------------------
## Define LSTM model
inputs = Input(name='inputs',shape=[max_len],dtype='float64')
## Embedding (vocabulary size, batch size, word length of each news)
layer = Embedding(max_words+1, 128, input_length=max_len)(inputs)
#layer = LSTM(128)(layer)
layer = CuDNNLSTM(128)(layer)

layer = Dense(128, activation="relu", name="FC1")(layer)
layer = Dropout(0.1)(layer)
layer = Dense(4, activation="softmax", name="FC2")(layer)
model = Model(inputs=inputs, outputs=layer)
model.summary()
model.compile(loss="categorical_crossentropy",
              optimizer='adam',  # RMSprop()
              metrics=["accuracy"])

#-------------------------------Step 6 model training and prediction--------------------------
## First set to train training and then set to test test
flag = "train"
if flag == "train":
    print("model training")
    ## Model training-loss Stop training when you no longer improve 0.0001
    model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=10,
                          validation_data=(val_seq_mat,val_y),
                          callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)]  
                         )
    model.save('my_model.h5')  
    del model
    elapsed = (time.clock() - start)
    print("Time used:", elapsed)
    print(model_fit.history)
 
else:
    print("model prediction ")
    ## Import the trained model
    model = load_model('my_model.h5')
    ## Predict the test set
    test_pre = model.predict(test_seq_mat)
    ## Evaluate the prediction effect and calculate the confusion matrix
    confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1))
    print(confm)
 
    ## Confusion matrix visualization
    Labname = ["Sports", "Culture", "Finance and Economics", "game"]
    print(metrics.classification_report(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)))
    plt.figure(figsize=(8,8))
    sns.heatmap(confm.T, square=True, annot=True,
                fmt='d', cbar=False, linewidths=.6,
                cmap="YlGnBu")
    plt.xlabel('True label',size = 14)
    plt.ylabel('Predicted label', size = 14)
    plt.xticks(np.arange(4)+0.8, Labname, size = 12)
    plt.yticks(np.arange(4)+0.4, Labname, size = 12)
    plt.savefig('result.png')
    plt.show()

    #----------------------------------Seventh, verify the algorithm--------------------------
    ## tok is used to preprocess the validation data set, and the trained model is used for prediction
    val_seq = tok.texts_to_sequences(val_df.cutword)
    ## Adjust each sequence to the same length
    val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
    ## Predict validation set
    val_pre = model.predict(val_seq_mat)
    print(metrics.classification_report(np.argmax(val_y,axis=1),np.argmax(val_pre,axis=1)))
    elapsed = (time.clock() - start)
    print("Time used:", elapsed)

The training output model is as follows:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
inputs (InputLayer)          (None, 600)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 600, 128)          768128    
_________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM)     (None, 128)               132096    
_________________________________________________________________
FC1 (Dense)                  (None, 128)               16512     
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
FC2 (Dense)                  (None, 4)                 516       
=================================================================
Total params: 917,252
Trainable params: 917,252
Non-trainable params: 0

The predicted results are as follows:

[[4539  153  188  120]
 [  47 4628  181  144]
 [ 113  133 4697   57]
 [ 101  292  157 4450]]
             precision    recall  f1-score   support

          0       0.95      0.91      0.93      5000
          1       0.89      0.93      0.91      5000
          2       0.90      0.94      0.92      5000
          3       0.93      0.89      0.91      5000

avg / total       0.92      0.92      0.92     20000

             precision    recall  f1-score   support

          0       0.96      0.89      0.92      5000
          1       0.89      0.94      0.92      5000
          2       0.90      0.93      0.92      5000
          3       0.94      0.92      0.93      5000

avg / total       0.92      0.92      0.92     20000

Vi BiLSTM Chinese text classification

1. Principle introduction

BiLSTM is the abbreviation of bi directional long short term memory, which is a combination of forward LSTM and backward LSTM. Both it and LSTM are often used to model context information in natural language processing tasks. The forward LSTM is combined with the backward LSTM to form the BiLSTM. For example, we code the sentence "I love China", and the model is shown in the figure.

There is still a problem in using LSTM to model sentences: it is impossible to encode back to front information. In more fine-grained classification, such as the five classification tasks of strong commendatory meaning, weak commendatory meaning, neutral, weak derogatory meaning and strong derogatory meaning, we need to pay attention to the interaction between emotional words, degree words and negative words. For example, "this restaurant is not as dirty as the next door". The "no" here is a modification of the degree of "dirty". BiLSTM can better capture the two-way semantic dependence.

Reference article: https://zhuanlan.zhihu.com/p/47802053

2. Code implementation

The BiLSTM code that Keras implements text classification is as follows:

Keras_BiLSTM_cnews.py

"""
Created on 2021-03-19
@author: xiuzhang Eastmount CSDN
BiLSTM Model
"""
import os
import time
import pickle
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.layers import Convolution1D, MaxPool1D, Flatten
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.callbacks import EarlyStopping
from keras.models import load_model
from keras.models import Sequential

#GPU accelerates CuDNNLSTM faster than LSTM
from keras.layers import CuDNNLSTM, CuDNNGRU
from keras.layers import Bidirectional

## If the GPU processing reader is a CPU, you can comment on this part of the code
## Specifies the maximum amount of video memory used in each GPU process 0.9 Indicates that it can be used GPU 90%Training with resources
os.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

start = time.clock()

#----------------------------Step 1 data reading----------------------------
## Read test data set
train_df = pd.read_csv("news_dataset_train_fc.csv")
val_df = pd.read_csv("news_dataset_val_fc.csv")
test_df = pd.read_csv("news_dataset_test_fc.csv")
print(train_df.head())

## Solve the problem of Chinese display
plt.rcParams['font.sans-serif'] = ['KaiTi']  #Specifies the default font SimHei bold
plt.rcParams['axes.unicode_minus'] = False   #Solution: the saved image is a negative sign'

#--------------------------Step 2 OneHotEncoder()code--------------------
## Encode the label data of the dataset
train_y = train_df.label
val_y = val_df.label
test_y = test_df.label
print("Label:")
print(train_y[:10])

le = LabelEncoder()
train_y = le.fit_transform(train_y).reshape(-1,1)
val_y = le.transform(val_y).reshape(-1,1)
test_y = le.transform(test_y).reshape(-1,1)
print("LabelEncoder")
print(train_y[:10])
print(len(train_y))

## Label data of the dataset-hot code
ohe = OneHotEncoder()
train_y = ohe.fit_transform(train_y).toarray()
val_y = ohe.transform(val_y).toarray()
test_y = ohe.transform(test_y).toarray()
print("OneHotEncoder:")
print(train_y[:10])

#-----------------------Step 3 use Tokenizer Encode phrases--------------------
max_words = 6000
max_len = 600
tok = Tokenizer(num_words=max_words)  #The maximum number of words is 6000
print(train_df.cutword[:5])
print(type(train_df.cutword))

## Prevent digital str processing in Corpus
train_content = [str(a) for a in train_df.cutword.tolist()]
val_content = [str(a) for a in val_df.cutword.tolist()]
test_content = [str(a) for a in test_df.cutword.tolist()]
tok.fit_on_texts(train_content)
print(tok)

## Save the trained Tokenizer and import
with open('tok.pickle', 'wb') as handle: #saving
    pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('tok.pickle', 'rb') as handle: #loading
    tok = pickle.load(handle)

#---------------------------Step 4 convert data into sequence-----------------------------
train_seq = tok.texts_to_sequences(train_content)
val_seq = tok.texts_to_sequences(val_content)
test_seq = tok.texts_to_sequences(test_content)

## Adjust each sequence to the same length
train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)
print("Data conversion sequence")
print(train_seq_mat.shape)
print(val_seq_mat.shape)
print(test_seq_mat.shape)
print(train_seq_mat[:2])

#-------------------------------Step 5 establish BiLSTM Model--------------------------
num_labels = 4
model = Sequential()
model.add(Embedding(max_words+1, 128, input_length=max_len))
model.add(Bidirectional(CuDNNLSTM(128)))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(num_labels, activation='softmax'))
model.summary()
model.compile(loss="categorical_crossentropy",
              optimizer='adam',  # RMSprop()
              metrics=["accuracy"])

#-------------------------------Step 6 model training and prediction--------------------------
## First set to train training and then set to test test
flag = "train"
if flag == "train":
    print("model training")
    ## Model training-loss Stop training when you no longer improve 0.0001
    model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=10,
                          validation_data=(val_seq_mat,val_y),
                          callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)]  
                         )
    model.save('my_model.h5')  
    del model
    elapsed = (time.clock() - start)
    print("Time used:", elapsed)
    print(model_fit.history)
 
else:
    print("model prediction ")
    ## Import the trained model
    model = load_model('my_model.h5')
    ## Predict the test set
    test_pre = model.predict(test_seq_mat)
    ## Evaluate the prediction effect and calculate the confusion matrix
    confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1))
    print(confm)
 
    ## Confusion matrix visualization
    Labname = ["Sports", "Culture", "Finance and Economics", "game"]
    print(metrics.classification_report(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)))
    plt.figure(figsize=(8,8))
    sns.heatmap(confm.T, square=True, annot=True,
                fmt='d', cbar=False, linewidths=.6,
                cmap="YlGnBu")
    plt.xlabel('True label',size = 14)
    plt.ylabel('Predicted label', size = 14)
    plt.xticks(np.arange(4)+0.5, Labname, size = 12)
    plt.yticks(np.arange(4)+0.5, Labname, size = 12)
    plt.savefig('result.png')
    plt.show()

    #----------------------------------Seventh, verify the algorithm--------------------------
    ## tok is used to preprocess the validation data set, and the trained model is used for prediction
    val_seq = tok.texts_to_sequences(val_df.cutword)
    ## Adjust each sequence to the same length
    val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
    ## Predict validation set
    val_pre = model.predict(val_seq_mat)
    print(metrics.classification_report(np.argmax(val_y,axis=1),np.argmax(val_pre,axis=1)))
    elapsed = (time.clock() - start)
    print("Time used:", elapsed)

The training output model is shown below, and the GPU time is still very fast.

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 600, 128)          768128    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256)               264192    
_________________________________________________________________
dense_1 (Dense)              (None, 128)               32896     
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 516       
=================================================================
Total params: 1,065,732
Trainable params: 1,065,732
Non-trainable params: 0

Train on 40000 samples, validate on 20000 samples
Epoch 1/10
40000/40000 [==============================] - 23s 587us/step - loss: 0.5825 - acc: 0.8038 - val_loss: 0.2321 - val_acc: 0.9246
Epoch 2/10
40000/40000 [==============================] - 21s 521us/step - loss: 0.1433 - acc: 0.9542 - val_loss: 0.2422 - val_acc: 0.9228
Time used: 52.763230400000005

The prediction results are shown in the figure below:

[[4593  143  113  151]
 [  81 4679   60  180]
 [ 110  199 4590  101]
 [  73  254   82 4591]]
             precision    recall  f1-score   support

          0       0.95      0.92      0.93      5000
          1       0.89      0.94      0.91      5000
          2       0.95      0.92      0.93      5000
          3       0.91      0.92      0.92      5000

avg / total       0.92      0.92      0.92     20000

             precision    recall  f1-score   support

          0       0.94      0.90      0.92      5000
          1       0.89      0.95      0.92      5000
          2       0.95      0.90      0.93      5000
          3       0.91      0.94      0.93      5000

avg / total       0.92      0.92      0.92     20000

VII BiLSTM+Attention Chinese text classification

1. Principle introduction

Attention mechanism is a problem-solving method proposed by imitating human attention. In short, it is to quickly screen out high-value information from a large amount of information. It is mainly used to solve the problem that it is difficult to obtain the final reasonable vector representation when the input sequence of LSTM/RNN model is long. The method is to retain the intermediate result of LSTM, learn it with a new model, and associate it with the output, so as to achieve the purpose of information screening.

What is attention？

First, briefly describe what the attention mechanism is. I believe NLP students will not be unfamiliar with this mechanism. It can be said to shine in the paper "Attention is all you need". In the machine translation task, it helped the depth model greatly improve its performance and output the best state of art model at that time. Of course, in addition to the attention mechanism, the model also uses many useful trick s to help improve the performance of the model. However, it cannot be denied that the core of this model is attention.

Attention mechanism, also known as attention mechanism, is a technology that enables the model to focus on important information and fully learn and absorb. It is not a complete model, but a technology that can act on any sequence model.

Why attention？

Why introduce the attention mechanism. For example, in the seq2seq model, for a text sequence, we usually use some mechanism to encode the sequence, encode it into a fixed length vector through dimension reduction and other methods, and input it to the subsequent full connection layer. Generally, we will use CNN or RNN (including GRU or LSTM) and other models to encode the sequence data, and then use various pooling or RNN to directly take the hidden state at the last t time as the vector output of the sentence.

However, there will be a problem here: conventional coding methods can not reflect the attention to different morphemes in a sentence sequence. In natural language, different parts of a sentence have different meanings and importance. For example, in the above example: I hate this movie If you do emotional analysis, it is obvious that you should pay more attention to the word hate. Of course, CNN and RNN can encode this information, but there is also an upper limit on this coding ability. For longer texts, the model effect will not be improved too much.

Reference and recommended articles: https://zhuanlan.zhihu.com/p/46313756

Attention has a wide range of applications, including text and pictures.

Text: applied to the seq2seq model. The most common application is translation
Image: image extraction applied to convolutional neural network
voice

The following figure is a classic BiLSTM+Attention model, which is also the model we need to establish next.

2. Code implementation

The BiLSTM+Attention code used by Keras to realize text classification is as follows:

Keras_Attention_BiLSTM_cnews.py

"""
Created on 2021-03-19
@author: xiuzhang Eastmount CSDN
BiLSTM+Attention Model
"""
import os
import time
import pickle
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.layers import Convolution1D, MaxPool1D, Flatten
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.callbacks import EarlyStopping
from keras.models import load_model
from keras.models import Sequential

#GPU accelerates CuDNNLSTM faster than LSTM
from keras.layers import CuDNNLSTM, CuDNNGRU
from keras.layers import Bidirectional

## If the GPU processing reader is a CPU, you can comment on this part of the code
## Specifies the maximum amount of video memory used in each GPU process 0.9 Indicates that it can be used GPU 90%Training with resources
os.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

start = time.clock()

#----------------------------Step 1 data reading----------------------------
## Read test data set
train_df = pd.read_csv("news_dataset_train_fc.csv")
val_df = pd.read_csv("news_dataset_val_fc.csv")
test_df = pd.read_csv("news_dataset_test_fc.csv")
print(train_df.head())

## Solve the problem of Chinese display
plt.rcParams['font.sans-serif'] = ['KaiTi']  #Specifies the default font SimHei bold
plt.rcParams['axes.unicode_minus'] = False   #Solution: the saved image is a negative sign'

#--------------------------Step 2 OneHotEncoder()code--------------------
## Encode the label data of the dataset
train_y = train_df.label
val_y = val_df.label
test_y = test_df.label
print("Label:")
print(train_y[:10])

le = LabelEncoder()
train_y = le.fit_transform(train_y).reshape(-1,1)
val_y = le.transform(val_y).reshape(-1,1)
test_y = le.transform(test_y).reshape(-1,1)
print("LabelEncoder")
print(train_y[:10])
print(len(train_y))

## Label data of the dataset-hot code
ohe = OneHotEncoder()
train_y = ohe.fit_transform(train_y).toarray()
val_y = ohe.transform(val_y).toarray()
test_y = ohe.transform(test_y).toarray()
print("OneHotEncoder:")
print(train_y[:10])

#-----------------------Step 3 use Tokenizer Encode phrases--------------------
max_words = 6000
max_len = 600
tok = Tokenizer(num_words=max_words)  #The maximum number of words is 6000
print(train_df.cutword[:5])
print(type(train_df.cutword))

## Prevent digital str processing in Corpus
train_content = [str(a) for a in train_df.cutword.tolist()]
val_content = [str(a) for a in val_df.cutword.tolist()]
test_content = [str(a) for a in test_df.cutword.tolist()]
tok.fit_on_texts(train_content)
print(tok)

## Save the trained Tokenizer and import
with open('tok.pickle', 'wb') as handle: #saving
    pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('tok.pickle', 'rb') as handle: #loading
    tok = pickle.load(handle)

#---------------------------Step 4 convert data into sequence-----------------------------
train_seq = tok.texts_to_sequences(train_content)
val_seq = tok.texts_to_sequences(val_content)
test_seq = tok.texts_to_sequences(test_content)

## Adjust each sequence to the same length
train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)
print("Data conversion sequence")
print(train_seq_mat.shape)
print(val_seq_mat.shape)
print(test_seq_mat.shape)
print(train_seq_mat[:2])

#---------------------------Step 5 establish Attention mechanism----------------------
"""
because Keras There is no ready-made Attention Layer can be used directly. We need to build a new layer function ourselves.
Keras The user-defined function is mainly divided into four parts:
init: Initialize some required parameters
bulid: How to define the weight
call: The core part defines how vectors operate
compute_output_shape: Defines the size of the layer's output
 Recommended articles:
    https://blog.csdn.net/huanghaocs/article/details/95752379
    https://zhuanlan.zhihu.com/p/29201491
"""

# Hierarchical Model with Attention
from keras import initializers
from keras import constraints
from keras import activations
from keras import regularizers
from keras import backend as K
from keras.engine.topology import Layer

K.clear_session()

class AttentionLayer(Layer):
    def __init__(self, attention_size=None, **kwargs):
        self.attention_size = attention_size
        super(AttentionLayer, self).__init__(**kwargs)
 
    def get_config(self):
        config = super().get_config()
        config['attention_size'] = self.attention_size
        return config
 
    def build(self, input_shape):
        assert len(input_shape) == 3

        self.time_steps = input_shape[1]
        hidden_size = input_shape[2]
        if self.attention_size is None:
            self.attention_size = hidden_size
 
        self.W = self.add_weight(name='att_weight', shape=(hidden_size, self.attention_size),
                                initializer='uniform', trainable=True)
        self.b = self.add_weight(name='att_bias', shape=(self.attention_size,),
                                initializer='uniform', trainable=True)
        self.V = self.add_weight(name='att_var', shape=(self.attention_size,),
                                initializer='uniform', trainable=True)
        super(AttentionLayer, self).build(input_shape)
 
    def call(self, inputs):
        self.V = K.reshape(self.V, (-1, 1))
        H = K.tanh(K.dot(inputs, self.W) + self.b)
        score = K.softmax(K.dot(H, self.V), axis=1)
        outputs = K.sum(score * inputs, axis=1)
        return outputs
 
    def compute_output_shape(self, input_shape):
        return input_shape[0], input_shape[2]

#-------------------------------Step 6 establish BiLSTM Model--------------------------
## Defining the BiLSTM model
## BiLSTM+Attention
num_labels = 4
inputs = Input(name='inputs',shape=[max_len],dtype='float64')
layer = Embedding(max_words+1, 256, input_length=max_len)(inputs)
#lstm = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.1, return_sequences=True))(layer)
bilstm = Bidirectional(CuDNNLSTM(128, return_sequences=True))(layer)  #Parameter hold dimension 3
layer = Dense(128, activation='relu')(bilstm)
layer = Dropout(0.2)(layer)
## Attention mechanism
attention = AttentionLayer(attention_size=50)(layer)
output = Dense(num_labels, activation='softmax')(attention)
model = Model(inputs=inputs, outputs=output)
model.summary()
model.compile(loss="categorical_crossentropy",
              optimizer='adam',      # RMSprop()
              metrics=["accuracy"])

#-------------------------------Step 7 model training and prediction--------------------------
## First set to train training and then set to test test
flag = "test"
if flag == "train":
    print("model training")
    ## Model training-loss Stop training when you no longer improve 0.0001
    model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=10,
                          validation_data=(val_seq_mat,val_y),
                          callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)]  
                         )
    ## Save model
    model.save('my_model.h5')  
    del model  # deletes the existing model
    ## computing time
    elapsed = (time.clock() - start)
    print("Time used:", elapsed)
    print(model_fit.history)
 
else:
    print("model prediction ")
    ## Import the trained model
    model = load_model('my_model.h5', custom_objects={'AttentionLayer': AttentionLayer(50)}, compile=False)
    ## Predict the test set
    test_pre = model.predict(test_seq_mat)
    ## Evaluate the prediction effect and calculate the confusion matrix
    confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1))
    print(confm)
 
    ## Confusion matrix visualization
    Labname = ["Sports", "Culture", "Finance and Economics", "game"]
    print(metrics.classification_report(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1)))
    plt.figure(figsize=(8,8))
    sns.heatmap(confm.T, square=True, annot=True,
                fmt='d', cbar=False, linewidths=.6,
                cmap="YlGnBu")
    plt.xlabel('True label',size = 14)
    plt.ylabel('Predicted label', size = 14)
    plt.xticks(np.arange(4)+0.5, Labname, size = 12)
    plt.yticks(np.arange(4)+0.5, Labname, size = 12)
    plt.savefig('result.png')
    plt.show()

    #----------------------------------Seventh, verify the algorithm--------------------------
    ## tok is used to preprocess the validation data set, and the trained model is used for prediction
    val_seq = tok.texts_to_sequences(val_df.cutword)
    ## Adjust each sequence to the same length
    val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
    ## Predict validation set
    val_pre = model.predict(val_seq_mat)
    print(metrics.classification_report(np.argmax(val_y,axis=1),np.argmax(val_pre,axis=1)))
 
    ## computing time
    elapsed = (time.clock() - start)
    print("Time used:", elapsed)

The training output model is as follows:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
inputs (InputLayer)          (None, 600)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 600, 256)          1536256   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 600, 256)          395264    
_________________________________________________________________
dense_1 (Dense)              (None, 600, 128)          32896     
_________________________________________________________________
dropout_1 (Dropout)          (None, 600, 128)          0         
_________________________________________________________________
attention_layer_1 (Attention (None, 128)               6500      
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 516       
=================================================================
Total params: 1,971,432
Trainable params: 1,971,432
Non-trainable params: 0

The prediction results are shown in the figure below:

[[4625  138  100  137]
 [  63 4692   77  168]
 [ 129  190 4589   92]
 [  82  299   78 4541]]
             precision    recall  f1-score   support

          0       0.94      0.93      0.93      5000
          1       0.88      0.94      0.91      5000
          2       0.95      0.92      0.93      5000
          3       0.92      0.91      0.91      5000

avg / total       0.92      0.92      0.92     20000

             precision    recall  f1-score   support

          0       0.95      0.91      0.93      5000
          1       0.88      0.95      0.91      5000
          2       0.95      0.90      0.92      5000
          3       0.92      0.93      0.93      5000

avg / total       0.92      0.92      0.92     20000

Click focus to learn about Huawei cloud's new technologies for the first time~

Topics: Deep Learning keras CNN

Programmer Think

Ten thousand word summary Keras deep learning Chinese text classification

I Text classification overview

II Data preprocessing and word segmentation

III CNN Chinese text classification

1. Principle introduction

2. Code implementation

IV TextCNN Chinese text classification

1. Principle introduction

2. Code implementation

V LSTM Chinese text classification

1. Principle introduction

2. Code implementation

Vi BiLSTM Chinese text classification

1. Principle introduction

2. Code implementation

VII BiLSTM+Attention Chinese text classification

1. Principle introduction

2. Code implementation

Hot Topics