Text generation in GPT-2

Posted by stanleybb on Fri, 22 Oct 2021 08:09:26 +0200

Implementation of BPE algorithm in the original text of BPE algorithm:

import re
import collections


def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i], symbols[i+1]] += freq  # Calculate byte pair occurrence frequency
    return pairs


def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))  # Escape characters in byte pairs that can be interpreted as regular operators
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')  # The byte pair to be merged can only be preceded by white space characters
    for word in v_in:
        w_out = p.sub(''.join(pair), word)  # Merge qualified byte pairs
        v_out[w_out] = v_in[word]
    return v_out

The above is BPE

vocab = {'l o w </w>': 5, 'l o w e r </w>': 2,
         'n e w e s t </w>': 6, 'w i d e s t </w>': 3}
num_merges = 10
for i in range(num_merges):
    pairs = get_stats(vocab)
    best = max(pairs, key=pairs.get)  # Select the byte pair with the highest frequency
    vocab = merge_vocab(best, vocab)
    print(best)

Code test

https://www.cnblogs.com/wwj99/p/12503545.html Process source
Train again according to the pre training model and check step by step according to the running order

import random

def select_top_k(predictions, k=10):
    predicted_index = random.choice(
        predictions[0, -1, :].sort(descending=True)[1][:10]).item()
    return predicted_index
    

Following the introduction of the GPT-2 model, we will use the GPT2Tokenizer() and GPT2LMHeadModel() classes encapsulated in the pytorch transformers model library to actually see the ability of GPT-2 to predict the next word after pre training. First, you need to install pytorch transformers.

Using the pytorch transformers model library, first set the example of preparing the input model, and use GPT2Tokenizer() to establish a word breaker object to encode the original sentence.

import torch
from pytorch_transformers import GPT2Tokenizer

import logging
logging.basicConfig(level=logging.INFO)

# Word splitter loaded with pre training model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Use GPT2Tokenizer to encode the input
text = "Yesterday, a man named Jack said he saw an alien,"
indexed_tokens = tokenizer.encode(text)
tokens_tensor = torch.tensor([indexed_tokens])
tokens_tensor.shape

!wget -nc "https://labfile.oss.aliyuncs.com/courses/1372/gpt2-shiyanlou.zip"
!unzip -o "gpt2-shiyanlou.zip"

from pytorch_transformers import GPT2LMHeadModel

# Read GPT-2 pre training model
model = GPT2LMHeadModel.from_pretrained("./")
model.eval()


I'm curious about what the pre training model downloaded has. Check it out.

Test the text writing with the short text above:

total_predicted_text = text
n = 100  # Number of cycles in the prediction process
for _ in range(n):
    with torch.no_grad():
        outputs = model(tokens_tensor)
        predictions = outputs[0]

    predicted_index = select_top_k(predictions, k=10)
    predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
    total_predicted_text += tokenizer.decode(predicted_index)

    if '<|endoftext|>' in total_predicted_text:
        # If the end of text flag appears, the text generation ends
        break

    indexed_tokens += [predicted_index]
    tokens_tensor = torch.tensor([indexed_tokens])

print(total_predicted_text)

After running, we observe the text generated by the model. We can see that it generally feels like a normal text. However, if we look carefully, we will find the logic problems in the statement, which will be solved by researchers in the future.

Next, we will use some drama scripts to fine tune GPT-2. Since the open source GPT-2 model pre training parameters of OpenAI team are obtained after pre training with English data sets, although Chinese data sets can be used during fine-tuning, it takes a lot of data and time to have good results. Therefore, we use English data sets for fine-tuning, so as to better show the ability of GPT-2 model.

First, download the training data set. Here, Shakespeare's play Romeo and Juliet is used as the training sample. We have downloaded the dataset in advance and put it in the blue bridge cloud course server. You can download it through the following command.

!wget -nc "https://labfile.oss.aliyuncs.com/courses/1372/romeo_and_juliet.zip"
!unzip -o "romeo_and_juliet.zip"


The data set read is a string.

Preprocess the training set, encode and segment the training set.

indexed_text = tokenizer.encode(dataset)
del(dataset)

dataset_cut = []
for i in range(len(indexed_text)//512):
    # Segment the string to 512
    dataset_cut.append(indexed_text[i*512:i*512+512])
del(indexed_text)

dataset_tensor = torch.tensor(dataset_cut)
dataset_tensor.shape

Here I run the tests separately
First encode the character tokenizer:

A total of 41669 is constructed as 81 * 512 = 41472, some characters are missing.

Check the data format. The data should not be completely divided. What if there is no division in the end?

Set dataset_ Convert cut to tensor format in pytorch

Here, we use the DataLoader() provided by PyTorch to build the training set data set representation, and use TensorDataset() to build the training set data iterator.

from torch.utils.data import DataLoader, TensorDataset

# Build data sets and data iterators, and set batch_ The size is 2
train_set = TensorDataset(dataset_tensor,
                          dataset_tensor)  # The label is the same as the sample data
train_loader = DataLoader(dataset=train_set,
                          batch_size=2,
                          shuffle=False)
train_loader

Input in training set data = output,

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

Start training.

from torch import nn
from torch.autograd import Variable
import time

pre = time.time()

epoch = 30  # Cycle learning 30 times

model.to(device)
model.train()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)  # Define optimizer

for i in range(epoch):
    total_loss = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = Variable(data).to(device), Variable(
            target).to(device)

        optimizer.zero_grad()

        loss, logits, _ = model(data, labels=target)

        total_loss += loss

        loss.backward()
        optimizer.step()

        if batch_idx == len(train_loader)-1:
            # Output the results at the end of each Epoch
            print('average loss:', total_loss/len(train_loader))

print('Training time:', time.time()-pre)

text = "From fairest creatures we desire"  # You can also enter different English texts here
indexed_tokens = tokenizer.encode(text)
tokens_tensor = torch.tensor([indexed_tokens])
model.eval()
total_predicted_text = text

# Make the trained model predict 500 times
for _ in range(500):
    tokens_tensor = tokens_tensor.to('cuda')

    with torch.no_grad():
        outputs = model(tokens_tensor)
        predictions = outputs[0]

    predicted_index = select_top_k(predictions, k=10)

    predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
    total_predicted_text += tokenizer.decode(predicted_index)
    if '<|endoftext|>' in total_predicted_text:
        # If the end of text flag appears, the text generation ends
        break

    indexed_tokens += [predicted_index]

    if len(indexed_tokens) > 1023:
        # The longest input length of the model is 1024. If the length is too long, it will be truncated
        indexed_tokens = indexed_tokens[-1023:]

    tokens_tensor = torch.tensor([indexed_tokens])

print(total_predicted_text)

Topics: AI