torchtext+textcnn emotion classification

Posted by nsantos on Sun, 30 Jan 2022 03:55:10 +0100

The data used in this example is three categories of English data. torchtext is used to process the data, build an iterator and build textcnn. The data is trained with textcnn to get the training results. In this example, the validation set is not used to evaluate the model.

1, Development environment and data set

1. Development environment

Ubuntu 16.04.6

python: 3.7

pytorch: 1.8.1

torchtext: 0.9.1

2. Data set

Dataset: train_data_sentiment
Extraction code: gw77

2, Working with datasets using torchtext

1. Import necessary Libraries

#Import common library
import torch
import pandas as pd
import matplotlib.pyplot as plt
from gensim.models import KeyedVectors
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as Data
import torch.nn.functional as F
import torchtext
#Torchtext. Is required to compare the latest version legacy. Data, the old version of torchtext uses torchtex data
from torchtext.legacy.data import TabularDataset 
import warnings
warnings.filterwarnings("ignore")
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu") #When I was writing my blog, our lab server 3 card was not used, so I used 3 card

2. Import and view datasets

#When you build a Dataset, you can also directly import the Dataset later
train_data = pd.read_csv('train_data_sentiment.csv')
train_data

3. Working with datasets using torchtext

Torchtext's data processing mainly includes three parts: defining Field, Dataset and iterator. It can easily process text data, such as word segmentation, truncation and length supplement, construction of thesaurus, etc. If you are not familiar with torchtext, you can learn the official documents or explain the blog.

3.1. Define Field

TEXT = torchtext.legacy.data.Field(sequential=True,lower=True,fix_length=30)#The default word breaker is split(), space division
LABEL = torchtext.legacy.data.Field(sequential=False,use_vocab=False)

sequential: whether to represent the data as a sequence. If False, word segmentation cannot be used. Default: True.
lower: whether to convert data to lowercase. Default: False.
fix_length: when building the iterator, modify the length of each text data to this value, truncate and supplement the length, and use pad_ Complete the token. Default: None.
use_vocab: whether to use dictionary objects If False, the data type must already be numeric. Default: True.

3.2. Define Dataset

TabularDataset can easily read data files in CSV, JSON or TSV format.

train_x = TabularDataset(path = 'train_data_sentiment.csv',
                        format = 'csv',skip_header=True,
                        fields = [('utterance',TEXT),('label',LABEL)])

skip_header = True, column names are not treated as data.
The order of fields should be the same as that of the original data columns.

You can see that the text data has been segmented

3.3. Build thesaurus and load pre training word vector

Because the computer doesn't know the text, we need to convert the text data into values or vectors, so that we can input it into textcnn or deep neural network for training. Firstly, the text data is constructed into the form of word index. Then, after loading the pre trained word vector, each word will correspond to a word vector, that is, the form of word vector. Finally, in the later model training, we can use the word Embedding matrix, that is, the Embedding layer. In this way, we can convert each word into a vector, which can be input into the model for training.

Here I will explain the word embedding matrix, because it takes me a long time to understand the word embedding matrix when I study. After we build the thesaurus, we can use the index to represent a sentence. According to the thesaurus built below, for example, it is you, this sentence can be represented by 109 3. Of course, we can directly use the index to input into the network for training, but the index represents too few features, In order to get better features to train the network, we usually use word2vec vector or glove vector. In this paper, glove vector is used, so that the features of each word can be better represented, which is more conducive to our training network. Each word in it is you will be represented by a 300 dimensional vector, which will input more features into the network, and our network model will be trained better. Summarize the word embedding matrix, that is: ① build a word list, that is, word index; ② Load the pre training word vector, i.e. word vector; ③ The word embedding matrix, i.e. index vector, is obtained.

#Build Thesaurus
TEXT.build_vocab(train_x) #10440 words were constructed from 0-10439
for w,i in TEXT.vocab.stoi.items():
    print(w,i)

#Load the glove word vector, and it will be downloaded automatically for the first time. You can also download the word vector yourself. I use 400000 words here, and each word is represented by a 300 dimensional vector
TEXT.vocab.load_vectors('glove.6B.300d',unk_init=torch.Tensor.normal_) #Randomly initialize the words that exist in the data but do not exist in the glove word vector and allocate a 300 dimensional vector

We can check the vectors in the built word embedding matrix. Here is the word vector with index 3 in the vocabulary we built, that is, the word you. Of course, this word is represented by a 300 dimensional vector. Here is only a screenshot.

Let's take another look at the vector corresponding to the word you in the glove vector. Here is only part of the screenshot, which shows that we can obtain the vector of the corresponding word through the index, that is, the meaning of the word embedding matrix.

#View word vector dimension
print(TEXT.vocab.vectors.shape) #torch.Size([10440, 300])

It can be seen that our data is divided into 10440 words, each word is represented by a 300 dimensional vector, and then we can build an iterator.

3.4. Building iterators

Iterators include Iterator and bucket Iterator

Iterator: build batch data in the same order as the original data.
Bucket iterator: build data with similar length into a batch of data, which will reduce the filling during truncation and lengthening.

Generally, when training the network, we will input a batch data every time. I set batch_size=64, then there are 9989 / / 64 + 1 = 157 batches, because we have a total of 9989 data, each batch has 64 data, and 9989 / 64 = 156, then the remaining 5 data will form a batch.

batch_size = 64
train_iter = torchtext.legacy.data.Iterator(dataset = train_x,batch_size=64,shuffle=True,sort_within_batch=False,repeat=False,device=device)
len(train_iter) #157

shuffle: whether to disturb the data
sort_within_batch: whether to sort the data in each batch
Repeat: whether to repeat the iterative batch data in different epoch s

View the built iterator and the internal data representation:

#View built iterators
list(train_iter)

#View the size of batch data
for batch in train_iter:
    print(batch.utterance.shape)

You can see that there are 64 pieces of data in each batch (except the last batch of data), that is, batch_size=64, each data consists of 30 words. We can also see that the last 5 data constitute a batch.

#View the first data
batch.utterance[:,0]#We take column 1, because column 1 represents the first data, that is, column 64 represents the 64th data. Each data consists of 30 words. The following non-1 part represents the index of the words in the first data in the thesaurus, and the remaining 1 represents the supplemented part.

#View the index value corresponding to the word in the first data
list_a=[]
for i in batch.utterance[:,0]:
    if i.item()!=1:
        list_a.append(i.item())
print(list_a)
for i in list_a:
    print(TEXT.vocab.itos[i],end=' ')

#View the data in the iterator and its corresponding text
l =[]
for batch in list(train_iter)[:1]:
    for i in batch.utterance:
        l.append(i[0].item())
    print(l)
    print(' '.join([TEXT.vocab.itos[i] for  i in l]))

So far, the data has been processed. Next, let's learn about textcnn.

3, textcnn knowledge and pytorch framework construction

1. textcnn knowledge

textcnn is similar to cnn which is familiar with image processing, but the size of convolution kernel in cnn is generally k * k, while the size of convolution kernel in NLP is generally k * embedding_size, where embedding_size use embding for each word_ Word vector representation of size dimension. For example, a sentence in the figure below is composed of three words, and each word is represented by a three-dimensional word vector. We select two convolution kernels with the size of 2 * 3 to obtain the following results. The pooled results will actually be spliced. There is no splicing here for demonstration.

As can be seen from the above figure, the result obtained after convolution of a convolution kernel is [len (sense) - K + 1,1], where len (sense) represents the number of words in a sentence. After pooling, the result of [2,1] will be obtained. This result is spliced.

After understanding the basic theoretical knowledge, we can build the network framework of textcnn. If you want to know more about textcnn, you can find materials to continue learning.

2. Using pytorch to build textcnn

I built a two-tier textcnn network. The framework of textcnn is mainly: convolution, activation and pooling.

Parameter description in network framework:

vocab_size: number of words in the constructed Thesaurus
embedding_size: word vector dimension of each word
num_channels: the number of output channels, that is, the number of convolution kernels
kernel_sizes: convolution kernel size

kernel_sizes, nums_channels =  [3, 4], [150, 150]
embedding_size = 300
num_class = 3
vocab_size = 10440

Here I build a two-layer textcnn network. The size of the convolution core of the first layer is 3 * 300, and there are 150 convolution cores with the same shape. The size of the convolution core of the second layer is 4 * 300. Similarly, there are 150 such convolution cores. The textcnn framework code is as follows:

class TextCNN(nn.Module):
    def __init__(self,kernel_sizes,num_channels):
        super(TextCNN,self).__init__()
        self.embedding = nn.Embedding(vocab_size,embedding_size) #embedding layer
        self.dropout = nn.Dropout(0.5) 
        self.convs = nn.ModuleList()
        for c,k in zip(num_channels,kernel_sizes):
            self.convs.append(nn.Sequential(nn.Conv1d(in_channels=embedding_size, 
                                       out_channels = c,            #Here, the number of output channels is [150150], that is, there are 150 convolution kernels with a size of 3*embedding_size and 4 * embedding_ Convolution kernel of size
                                       kernel_size = k),            # The size of convolution kernel, here refers to 3 and 4
                             nn.ReLU(),                             #Activation function
                             nn.MaxPool1d(30-k+1)))                 #Pooling: select a maximum value from the 30-k+1 results after convolution. Each data represented by 30 is composed of 30 words,
        self.decoder = nn.Linear(sum(num_channels),3)               #In the full connection layer, the input is a 300 dimensional vector and the output is a 3-dimensional, that is, the number of classifications
        
    def forward(self,inputs):
        embed = self.embedding(inputs)   #[30,64,300]
        embed = embed.permute(1,2,0)     #[64300,30], this step is to exchange dimensions in order to comply with the input of the subsequent convolution operation
        #In the next encoding step, after two layers of textcnn, each layer will get a result of [64150,1], followed by [64150] after squeeze, and then splice the two results to get [64300]
        encoding = torch.cat([conv(embed).squeeze(-1) for conv in self.convs],dim=1) #[64,300] 
        outputs =self.decoder(self.dropout(encoding))   #Input [64300] to the full connection layer, and finally get the result of [64,3]
        return outputs

Let's take a look at the built network framework

net = TextCNN(kernel_sizes, nums_channels).to(device)
net

As can be seen from the figure, the network has two layers. The difference between these two layers is that the size of convolution kernel and pooling layer are different, and the others are the same.

Enter the number of channels in_channels=300；
Number of output channels out_channels=150；
Convolution kernel size: the first layer is 3 * 300, and the second layer is 4 * 300
The step size of convolution is stripe = 1
The step size of the pool layer is stripe = 30 kernel_ Size + 1, because each data set consists of 30 words, the purpose of pooling is to select the largest value from the results after convolution, which is the purpose of maximum pooling, and the result after convolution is [30 kernel_size + 1,1]. You can push this step yourself, which is very simple.

4, Model training and results

1. Define parameters such as training function, optimizer and loss function

Generally, I define the training function and call it for model training. You can operate it at will.

net.embedding.weight.data.copy_(TEXT.vocab.vectors)    #Pass in our word Embedding matrix to the Embedding layer of the model
optimizer = optim.Adam(net.parameters(),lr=1e-4)       #Define the optimizer. lr is the learning rate, which can be adjusted by yourself
criterion = nn.CrossEntropyLoss().to(device)           #Define loss function
train_x_len = len(train_x)                             #This step is the amount of data I obtained to calculate the following Acc, that is, 9989

#Define training function
def train(net,iterator,optimizer,criterion,train_x_len):
    epoch_loss = 0                           #Initialize loss value
    epoch_acc = 0                            #Initialize acc value
    for batch in iterator:
        optimizer.zero_grad()                #Gradient clearing
        preds = net(batch.utterance)         #Forward propagation to obtain the predicted value
        loss = criterion(preds,batch.label)  #Calculate loss
        epoch_loss +=loss.item()             #Accumulate loss as the numerator of the average loss below
        loss.backward()                      #Back propagation
        optimizer.step()                     #Update weight parameters in the network
        epoch_acc+=((preds.argmax(axis=1))==batch.label).sum().item()   #Accumulate acc as the molecule for averaging acc below
    return epoch_loss/(train_x_len),epoch_acc/train_x_len    #The values returned are loss and acc

2. Conduct training

A total of 100 rounds of training were conducted, and printouts were conducted every 10 rounds.

n_epoch = 100
acc_plot=[]     #For drawing later
loss_plot=[]    #For drawing later
for epoch in range(n_epoch):
    train_loss,train_acc = train(net,train_iter,optimizer,criterion,train_x_len)
    acc_plot.append(train_acc)
    loss_plot.append(train_loss)
    if (epoch+1)%10==0:
        print('epoch: %d \t loss: %.4f \t train_acc: %.4f'%(epoch+1,train_loss,train_acc))

The results are as follows:

3. Visualization results

#Using the drawing function matplotlib
plt.figure(figsize =(10,5),dpi=80)
plt.plot(acc_plot,label='train_acc')
plt.plot(loss_plot,color='coral',label='train_loss')
plt.legend(loc = 0)
plt.grid(True,linestyle = '--',alpha=1)
plt.xlabel('epoch',fontsize = 15)
plt.show()

5, Summary

This paper mainly uses torchtext to process the training data set and construct it as an iterator for model input. It does not use the verification set to evaluate the model. You can try to separate the verification set from the training set to evaluate the model. The complete code is as follows:

#I have tested the code and there is no problem. I commented out some code for printout,
#Import common library
import torch
import pandas as pd
import matplotlib.pyplot as plt
from gensim.models import KeyedVectors
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as Data
import torch.nn.functional as F
import torchtext
#Torchtext. Is required to compare the latest version legacy. Data, the old version of torchtext uses torchtex data
from torchtext.legacy.data import TabularDataset 
import warnings
warnings.filterwarnings("ignore")
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu") #When I was writing my blog, our lab server 3 card was not used, so I used 3 card

train_data = pd.read_csv('train_data_sentiment.csv')
# train_data

#Processing data using torchtext
#Define filed
TEXT = torchtext.legacy.data.Field(sequential=True,lower=True,fix_length=30)
LABEL = torchtext.legacy.data.Field(sequential=False,use_vocab=False)

train_x = TabularDataset(path = 'train_data_sentiment.csv',
                        format = 'csv',skip_header=True,
                        fields = [('utterance',TEXT),('label',LABEL)])

# print(train_x[0].utterance)
# print(train_x[0].label)

TEXT.build_vocab(train_x)
# for w,i in TEXT.vocab.stoi.items():
#     print(w,i)

TEXT.vocab.load_vectors('glove.6B.300d',unk_init=torch.Tensor.normal_)

glove_model = KeyedVectors.load_word2vec_format('glove.6B.300d.word2vec.txt', binary=False)
# glove_model['you']

# print(TEXT.vocab.vectors.shape) #torch.Size([10440, 300])

batch_size = 64
train_iter = torchtext.legacy.data.Iterator(dataset = train_x,batch_size=64,shuffle=True,sort_within_batch=False,repeat=False,device=device)

# len(train_iter)
# list(train_iter)

# for batch in train_iter:
#     print(batch.utterance.shape)

# batch.utterance[:,0]

# list_a=[]
# for i in batch.utterance[:,0]:
#     if i.item()!=1:
#         list_a.append(i.item())
# print(list_a)
# for i in list_a:
#     print(TEXT.vocab.itos[i],end=' ')

# l =[]
# for batch in list(train_iter)[:1]:
#     for i in batch.utterance:
#         l.append(i[0].item())
#     print(l)
#     print(' '.join([TEXT.vocab.itos[i] for  i in l]))

kernel_sizes, nums_channels =  [3, 4], [150, 150]
embedding_size = 300
num_class = 3
vocab_size = 10440

#Build textcnn
class TextCNN(nn.Module):
    def __init__(self,kernel_sizes,num_channels):
        super(TextCNN,self).__init__()
        self.embedding = nn.Embedding(vocab_size,embedding_size)
        self.dropout = nn.Dropout(0.5) 
        self.convs = nn.ModuleList()
        for c,k in zip(num_channels,kernel_sizes):
            self.convs.append(nn.Sequential(nn.Conv1d(in_channels=embedding_size, 
                                       out_channels = c,           
                                       kernel_size = k),           
                             nn.ReLU(),                           
                             nn.MaxPool1d(30-k+1)))                
        self.decoder = nn.Linear(sum(num_channels),3)              
    def forward(self,inputs):
        embed = self.embedding(inputs)
        embed = embed.permute(1,2,0)
        encoding = torch.cat([conv(embed).squeeze(-1) for conv in self.convs],dim=1)
        outputs =self.decoder(self.dropout(encoding))
        return outputs 

net = TextCNN(kernel_sizes, nums_channels).to(device)
# net

net.embedding.weight.data.copy_(TEXT.vocab.vectors)    
optimizer = optim.Adam(net.parameters(),lr=1e-4)      
criterion = nn.CrossEntropyLoss().to(device)          
train_x_len = len(train_x)   

#Define training function
def train(net,iterator,optimizer,criterion,train_x_len):
    epoch_loss = 0                           
    epoch_acc = 0                            
    for batch in iterator:
        optimizer.zero_grad()                
        preds = net(batch.utterance)         
        loss = criterion(preds,batch.label)  
        epoch_loss +=loss.item()             
        loss.backward()                      
        optimizer.step()                     
        epoch_acc+=((preds.argmax(axis=1))==batch.label).sum().item()   
    return epoch_loss/(train_x_len),epoch_acc/train_x_len    

n_epoch = 100
acc_plot=[]     
loss_plot=[]    
for epoch in range(n_epoch):
    train_loss,train_acc = train(net,train_iter,optimizer,criterion,train_x_len)
    acc_plot.append(train_acc)
    loss_plot.append(train_loss)
    if (epoch+1)%10==0:
        print('epoch: %d \t loss: %.4f \t train_acc: %.4f'%(epoch+1,train_loss,train_acc))
        
plt.figure(figsize =(10,5),dpi=80)
plt.plot(acc_plot,label='train_acc')
plt.plot(loss_plot,color='coral',label='train_loss')
plt.legend(loc = 0)
plt.grid(True,linestyle = '--',alpha=1)
plt.xlabel('epoch',fontsize = 15)
plt.show()

Topics: Pytorch Deep Learning Convolutional Neural Networks

Programmer Think