textRCNN intensive reading and reproduction

Posted by gooney0 on Thu, 03 Mar 2022 14:36:24 +0100

title: textRCNN intensive reading and reproduction
date: 2022-03-03 20:16:16
tags:

paper intensive reading
Reappearance
pytorch

Thesis address: Recurrent Convolutional Neural Networks for Text Classification | Papers With Code

Model architecture

Bidirectional cyclic neural network and maximum pooling layer. The embedding mentioned in the paper is pre trained by word2vec, and the corresponding resources are not found.

experiment

data

Using the Stanford sentient treebank dataset mentioned in the paper

Firstly, the data set is processed to obtain the values of text and sentiment_labels.txt file.

For data preprocessing, please refer to the blog post: csdn blog Thank you thank you~~

Next, write a dataset

'''
author: yxr
date: 2022-2-26
introduce: RPNN Data processing part of the implementation
'''
import pandas as pd
import torch
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

# import torchtext
# from torchtext.data import get_tokenizer

pad_size = 64
# Generate word2idx Thesaurus
def make_dictionary():
    df1 = pd.read_csv('./train_final.txt', header=None, delimiter='\t')
    df2 = pd.read_csv('./test_final.txt', header=None, delimiter='\t')
    df3 = pd.read_csv('./valid_final.txt', header=None, delimiter='\t')
    frame = [df1, df2, df3]
    df = pd.concat(frame, axis=0)
    df.columns = ['sentence', 'label']
    sentences = df.sentence.values
    word2ids = {'[SEP]': 0, '[CLS]': 1, '[PAD]': 2}
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
    for sentence in sentences:
        tokens = tokenizer.tokenize(sentence)
        for token in tokens:
            if token not in word2ids.keys():
                word2ids[token] = len(word2ids)
    return word2ids


def label_class(score):
    if score >= 0 and score <= 0.2:
        tmp_label = 0
    elif score > 0.2 and score <= 0.4:
        tmp_label = 1
    elif score > 0.4 and score <= 0.6:
        tmp_label = 2
    elif score > 0.6 and score <= 0.8:
        tmp_label = 3
    else:
        tmp_label = 4
    return tmp_label


def data_process(data_path, word2idx):
    df = pd.read_csv(data_path, header=None, delimiter='\t')
    df.columns = ['sentence', 'label']
    # df = df[:3]
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

    sentences = df.sentence.values
    sentence_labels = df.label.values

    input_ids = []
    input_tokens = []
    labels = []
    for sent, ll in zip(sentences, sentence_labels):
        token_id = []
        tmp_tokens = tokenizer.tokenize('[CLS]' + sent)
        if len(tmp_tokens) > pad_size - 1:
            tmp_tokens = tmp_tokens[:pad_size - 1]
        else:
            tmp_tokens = tmp_tokens + ['[PAD]'] * (pad_size - 1 - len(tmp_tokens))
        tmp_tokens.append('[SEP]')
        for token in tmp_tokens:
            token_id.append(word2idx[token])
        tmp_label = label_class(ll)
        input_ids.append(token_id)
        input_tokens.append(tmp_tokens)
        labels.append(tmp_label)
    input_ids = torch.tensor(input_ids)
    labels = torch.tensor(labels)
    # print(input_tokens)   # No string tensor
    return TensorDataset(input_ids, labels), input_tokens

The tokenizer here uses the tools in BERT. It is different from the implementation of pytorch on github. In fact, there should be other tools that can be used, and its volume will be much smaller.

Step: generate a user-defined Thesaurus Dictionary -- "word segmentation" of text -- use the thesaurus to get input_id ---- return TensorDataset. Note: str type data cannot be stored in tensor.

model

The next step is to compare the core model part. In fact, it is to follow the paper.

import torch
import torch.nn as nn

embed_size = 300  # 50
hidden_size = 512  # 100
pad_size = 64   # 50
class_num = 5


class Embedding(nn.Module):
    def __init__(self, vocab_size, is_pretrain=False):
        super(Embedding, self).__init__()
        if is_pretrain:
            # self.embedding = nn.Embedding.from_pretrained(word_vectors)
            self.embedding.weight.requires_grad = False
        else:
            self.embedding = nn.Embedding(vocab_size, embed_size)

    def forward(self, input_ids):
        return self.embedding(input_ids)


class RCNN(nn.Module):
    def __init__(self, vocab_size):
        super(RCNN, self).__init__()
        self.embedding = Embedding(vocab_size=vocab_size)
        self.bilstm = nn.LSTM(input_size=embed_size, hidden_size=hidden_size, bidirectional=True)
        self.linear1 = nn.Linear(embed_size + 2 * hidden_size, hidden_size)
        self.tanh = nn.Tanh()
        self.maxpooling = nn.MaxPool1d(pad_size)
        self.linear2 = nn.Linear(hidden_size, class_num)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x):
        # Word embedding [batch_size * pad_size * embedding_size]
        # print('input.shape:', x.shape)
        embed = self.embedding(x)
        # print('embed.shape:', embed.shape)
        # Bidirectional LSTM [batch_size * pad_size * 2 * hidden_size]
        lstm_out, _ = self.bilstm(embed)
        # print('lstm.shape:', lstm_out.shape)
        out = torch.cat((embed, lstm_out), 2)
        out = self.tanh(self.linear1(out))
        out = out.permute(0, 2, 1)
        out = self.maxpooling(out).squeeze()
        # print('out.shape:', out.shape)
        out = self.linear2(out)
        out = self.softmax(out)

        return out

Pay attention to the use of LSTM here. In fact, there are still some unclear about the implementation of LSTM. Let's see the reasoning again tomorrow.

When writing a relatively simple network architecture with pytorch, the most important thing is to understand the dimension. Because most of the network architectures used are written, you only need to pass parameters.

train

The next step is to use the above data processing and model part to input the data into the model - start running

import sys
sys.path.append('./')
# ==========Get embedding matrix
import gensim
# Load the trained model
import torch
import torch.nn as nn
import torch.optim as Optim
import data_preprocess
from transformers import BertTokenizer

from torch.utils.data import DataLoader, RandomSampler
import rcnn_model
import time
import pandas as pd

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

# read in data
print('==Start reading data:===')
word2idx = data_preprocess.make_dictionary()
vocab_size = len(word2idx)
train_dataset, train_tokens = data_preprocess.data_process('./train_final.txt', word2idx)
train_loader = DataLoader(train_dataset,
                          sampler=RandomSampler(train_dataset),
                          batch_size=10)

test_dataset, test_tokens = data_preprocess.data_process('./test_final.txt', word2idx)
test_loader = DataLoader(test_dataset,
                         sampler=RandomSampler(test_dataset),
                         batch_size=10)
print('==Start loading model:===')
# model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
# word_vectors = torch.randn([30522, 300])  # Initialize the embedding matrix of words
# for i in range(0, 30522):
#     token = tokenizer.decode(i)
#     if token in model:
#         word_vectors[i, :] = torch.from_numpy(model[token])

model = rcnn_model.RCNN(vocab_size=vocab_size)
criterion = nn.CrossEntropyLoss()
optimizer = Optim.Adadelta(model.parameters(), lr=0.03)

acc = []

for epoch in range(50):
    print('==train===')
    acc1 = []
    accuracy = 0.0
    data_batch_num = 0
    total_loss = 0.0
    t_begin = time.time()
    for batch in train_loader:
        input_ids = batch[0]
        labels = batch[1]
        output = model(input_ids)
        out_class = torch.argmax(output, dim=1)
        loss = criterion(output, labels)
        # Parameter update
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # Accuracy
        accuracy += int(sum(out_class == labels))
        data_batch_num += len(labels)

    accuracy = float(accuracy / len(train_tokens))
    print('===acc:%.3f' % accuracy, '==time:', time.time() - t_begin)
    acc1.append(str(accuracy).format(':.4f'))
    print('===test===test')
    accuracy = 0.0
    for batch in test_loader:
        input_ids = batch[0]
        labels = batch[1]
        with torch.no_grad():
            output = model(input_ids)
        out_class = torch.argmax(output, dim=1)
        accuracy += int(sum(out_class == labels))
    accuracy = float(accuracy / len(test_tokens))
    print('===acc:%.3f' % accuracy)
    acc1.append(str(accuracy).format(':.4f'))
    acc.append(acc1)

acc = pd.DataFrame(acc, columns=['train', 'test-train', 'test'])
acc.to_csv('accuracy.csv', index=False, )

Embedding here originally used a google pre trained embedding matrix, which can pass in words and get the embedding vector of words, but the effect is very general. Later, it was directly removed and replaced with NN Embedding (vocab_size, embed_size), then adjust the word embedding vector in the training process.

result

According to README given by data, the data set has five categories, and the test accuracy is about 37%. It did not reach more than 40% given in the paper.

The reason may be the word embedding part?? However, it is still uncertain. I feel that making discrete scores into categories will lose some information. I am prepared not to classify. Change the model output to a value between [0-1], and then use the prediction results to judge the accuracy.

It seems that the quality of the data set is not very high, and the data labels should not be marked manually.

Topics: Machine Learning Pytorch

Programmer Think