title: textRCNN intensive reading and reproduction
date: 2022-03-03 20:16:16
tags:
- paper intensive reading
- Reappearance
- pytorch
Thesis address: Recurrent Convolutional Neural Networks for Text Classification | Papers With Code
Model architecture
Bidirectional cyclic neural network and maximum pooling layer. The embedding mentioned in the paper is pre trained by word2vec, and the corresponding resources are not found.
experiment
data
Using the Stanford sentient treebank dataset mentioned in the paper
Firstly, the data set is processed to obtain the values of text and sentiment_labels.txt file.
For data preprocessing, please refer to the blog post: csdn blog Thank you thank you~~
Next, write a dataset
''' author: yxr date: 2022-2-26 introduce: RPNN Data processing part of the implementation ''' import pandas as pd import torch from transformers import BertTokenizer from torch.utils.data import TensorDataset # import torchtext # from torchtext.data import get_tokenizer pad_size = 64 # Generate word2idx Thesaurus def make_dictionary(): df1 = pd.read_csv('./train_final.txt', header=None, delimiter='\t') df2 = pd.read_csv('./test_final.txt', header=None, delimiter='\t') df3 = pd.read_csv('./valid_final.txt', header=None, delimiter='\t') frame = [df1, df2, df3] df = pd.concat(frame, axis=0) df.columns = ['sentence', 'label'] sentences = df.sentence.values word2ids = {'[SEP]': 0, '[CLS]': 1, '[PAD]': 2} tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) for sentence in sentences: tokens = tokenizer.tokenize(sentence) for token in tokens: if token not in word2ids.keys(): word2ids[token] = len(word2ids) return word2ids def label_class(score): if score >= 0 and score <= 0.2: tmp_label = 0 elif score > 0.2 and score <= 0.4: tmp_label = 1 elif score > 0.4 and score <= 0.6: tmp_label = 2 elif score > 0.6 and score <= 0.8: tmp_label = 3 else: tmp_label = 4 return tmp_label def data_process(data_path, word2idx): df = pd.read_csv(data_path, header=None, delimiter='\t') df.columns = ['sentence', 'label'] # df = df[:3] tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) sentences = df.sentence.values sentence_labels = df.label.values input_ids = [] input_tokens = [] labels = [] for sent, ll in zip(sentences, sentence_labels): token_id = [] tmp_tokens = tokenizer.tokenize('[CLS]' + sent) if len(tmp_tokens) > pad_size - 1: tmp_tokens = tmp_tokens[:pad_size - 1] else: tmp_tokens = tmp_tokens + ['[PAD]'] * (pad_size - 1 - len(tmp_tokens)) tmp_tokens.append('[SEP]') for token in tmp_tokens: token_id.append(word2idx[token]) tmp_label = label_class(ll) input_ids.append(token_id) input_tokens.append(tmp_tokens) labels.append(tmp_label) input_ids = torch.tensor(input_ids) labels = torch.tensor(labels) # print(input_tokens) # No string tensor return TensorDataset(input_ids, labels), input_tokens
The tokenizer here uses the tools in BERT. It is different from the implementation of pytorch on github. In fact, there should be other tools that can be used, and its volume will be much smaller.
Step: generate a user-defined Thesaurus Dictionary -- "word segmentation" of text -- use the thesaurus to get input_id ---- return TensorDataset. Note: str type data cannot be stored in tensor.
model
The next step is to compare the core model part. In fact, it is to follow the paper.
import torch import torch.nn as nn embed_size = 300 # 50 hidden_size = 512 # 100 pad_size = 64 # 50 class_num = 5 class Embedding(nn.Module): def __init__(self, vocab_size, is_pretrain=False): super(Embedding, self).__init__() if is_pretrain: # self.embedding = nn.Embedding.from_pretrained(word_vectors) self.embedding.weight.requires_grad = False else: self.embedding = nn.Embedding(vocab_size, embed_size) def forward(self, input_ids): return self.embedding(input_ids) class RCNN(nn.Module): def __init__(self, vocab_size): super(RCNN, self).__init__() self.embedding = Embedding(vocab_size=vocab_size) self.bilstm = nn.LSTM(input_size=embed_size, hidden_size=hidden_size, bidirectional=True) self.linear1 = nn.Linear(embed_size + 2 * hidden_size, hidden_size) self.tanh = nn.Tanh() self.maxpooling = nn.MaxPool1d(pad_size) self.linear2 = nn.Linear(hidden_size, class_num) self.softmax = nn.Softmax(dim=-1) def forward(self, x): # Word embedding [batch_size * pad_size * embedding_size] # print('input.shape:', x.shape) embed = self.embedding(x) # print('embed.shape:', embed.shape) # Bidirectional LSTM [batch_size * pad_size * 2 * hidden_size] lstm_out, _ = self.bilstm(embed) # print('lstm.shape:', lstm_out.shape) out = torch.cat((embed, lstm_out), 2) out = self.tanh(self.linear1(out)) out = out.permute(0, 2, 1) out = self.maxpooling(out).squeeze() # print('out.shape:', out.shape) out = self.linear2(out) out = self.softmax(out) return out
Pay attention to the use of LSTM here. In fact, there are still some unclear about the implementation of LSTM. Let's see the reasoning again tomorrow.
When writing a relatively simple network architecture with pytorch, the most important thing is to understand the dimension. Because most of the network architectures used are written, you only need to pass parameters.
train
The next step is to use the above data processing and model part to input the data into the model - start running
import sys sys.path.append('./') # ==========Get embedding matrix import gensim # Load the trained model import torch import torch.nn as nn import torch.optim as Optim import data_preprocess from transformers import BertTokenizer from torch.utils.data import DataLoader, RandomSampler import rcnn_model import time import pandas as pd tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) # read in data print('==Start reading data:===') word2idx = data_preprocess.make_dictionary() vocab_size = len(word2idx) train_dataset, train_tokens = data_preprocess.data_process('./train_final.txt', word2idx) train_loader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=10) test_dataset, test_tokens = data_preprocess.data_process('./test_final.txt', word2idx) test_loader = DataLoader(test_dataset, sampler=RandomSampler(test_dataset), batch_size=10) print('==Start loading model:===') # model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) # word_vectors = torch.randn([30522, 300]) # Initialize the embedding matrix of words # for i in range(0, 30522): # token = tokenizer.decode(i) # if token in model: # word_vectors[i, :] = torch.from_numpy(model[token]) model = rcnn_model.RCNN(vocab_size=vocab_size) criterion = nn.CrossEntropyLoss() optimizer = Optim.Adadelta(model.parameters(), lr=0.03) acc = [] for epoch in range(50): print('==train===') acc1 = [] accuracy = 0.0 data_batch_num = 0 total_loss = 0.0 t_begin = time.time() for batch in train_loader: input_ids = batch[0] labels = batch[1] output = model(input_ids) out_class = torch.argmax(output, dim=1) loss = criterion(output, labels) # Parameter update optimizer.zero_grad() loss.backward() optimizer.step() # Accuracy accuracy += int(sum(out_class == labels)) data_batch_num += len(labels) accuracy = float(accuracy / len(train_tokens)) print('===acc:%.3f' % accuracy, '==time:', time.time() - t_begin) acc1.append(str(accuracy).format(':.4f')) print('===test===test') accuracy = 0.0 for batch in test_loader: input_ids = batch[0] labels = batch[1] with torch.no_grad(): output = model(input_ids) out_class = torch.argmax(output, dim=1) accuracy += int(sum(out_class == labels)) accuracy = float(accuracy / len(test_tokens)) print('===acc:%.3f' % accuracy) acc1.append(str(accuracy).format(':.4f')) acc.append(acc1) acc = pd.DataFrame(acc, columns=['train', 'test-train', 'test']) acc.to_csv('accuracy.csv', index=False, )
Embedding here originally used a google pre trained embedding matrix, which can pass in words and get the embedding vector of words, but the effect is very general. Later, it was directly removed and replaced with NN Embedding (vocab_size, embed_size), then adjust the word embedding vector in the training process.
result
According to README given by data, the data set has five categories, and the test accuracy is about 37%. It did not reach more than 40% given in the paper.
The reason may be the word embedding part?? However, it is still uncertain. I feel that making discrete scores into categories will lose some information. I am prepared not to classify. Change the model output to a value between [0-1], and then use the prediction results to judge the accuracy.
It seems that the quality of the data set is not very high, and the data labels should not be marked manually.