Task 10: Transformers solve machine translation

Posted by MBenefactor on Thu, 30 Sep 2021 21:34:54 +0200

1 fine tune the transformer model to solve the translation task

! pip install datasets transformers "sacrebleu>=1.4.12,<2.0.0" sentencepiece
model_checkpoint = "Helsinki-NLP/opus-mt-en-ro" 
# Select a model checkpoint

As long as the pre trained transformer model contains the head layer of seq2seq structure, this notebook can theoretically use a variety of transformer model panels

1.1 loading data

Use Datasets library to load data and corresponding evaluation methods

from datasets import load_dataset, load_metric

raw_datasets = load_dataset("wmt16", "ro-en")
metric = load_metric("sacrebleu")

The datasets object itself is a DatasetDict data structure. For training sets, validation sets and test sets, you only need to use the corresponding key (train, validation, test) to get the corresponding data

# We can see that an English en corresponds to a Romanian ro

To understand what data looks like

import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
fake_preds = ["hello there", "general kenobi"]
fake_labels = [["hello there"], ["general kenobi"]]
metric.compute(predictions=fake_preds, references=fake_labels)

Before feeding the data into the model, we need to preprocess the data. The preprocessing tool is tokenizer. Tokenizer first tokenizes the input, then converts the tokens into the corresponding token ID in the pre model, and then into the input format required by the model.

1.2 data preprocessing

For the purpose of data preprocessing, we use autotokenizer. From_ The pretrained method instantiates our tokenizer to ensure that:

We get a tokenizer corresponding to the pre training model one by one.
When using the tokenizer corresponding to the specified model checkpoint, we also downloaded the thesaurus vocabulary required by the model, which is exactly tokens vocabulary.

from transformers import AutoTokenizer
# 'sentincepiace 'needs to be installed: PIP install sentincepiace
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
if "mbart" in model_checkpoint:
    tokenizer.src_lang = "en-XX"
    tokenizer.tgt_lang = "ro-RO"

Tokenizer can preprocess either a single text or a pair of text. The data obtained by tokenizer after preprocessing meets the input format of pre training model

tokenizer("Hello, this one sentence!")

The token IDs seen above is input_ids generally varies with the name of the pre training model. The reason is that different pre training models set different rules during pre training. However, as long as the names of tokenizer and model are consistent, the input format preprocessed by tokenizer will meet the requirements of model. Refer to this tutorial for more information on preprocessing

In addition to tokenize a sentence, we can also tokenize a list sentence.

tokenizer(["Hello, this one sentence!", "This is another sentence."])
with tokenizer.as_target_tokenizer():
    print(tokenizer("Hello, this one sentence!"))
    model_input = tokenizer("Hello, this one sentence!")
    tokens = tokenizer.convert_ids_to_tokens(model_input['input_ids'])
    # Print and take a look at the special toke n
    print('tokens: {}'.format(tokens))

If you use the checkpoints of T5 pre training model, you need to check the special prefix. T5 uses a special prefix to tell the model the specific tasks to be done. Specific prefix examples are as follows:

if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "translate English to Romanian: "
    prefix = ""
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "ro"

def preprocess_function(examples):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Next, preprocess all samples in the dataset datasets by using the map function to prepare the preprocessing function_ train_ Features are applied to all samples.

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

1.3 fine tuning transformer model

Now that the data is ready, we need to download and load our pre training model, and then fine tune the pre training model. Since we are doing the seq2seq task, we need a model class that can solve this task. We use the AutoModelForSeq2SeqLM class. Similar to tokenizer, the from_pre trained method can also help us download and load the model At the same time, the model will be cached, so the model will not be downloaded repeatedly.

from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
batch_size = 16
args = Seq2SeqTrainingArguments(
    evaluation_strategy = "epoch",

The above evaluation_strategy = "epoch" parameter tells the training code that we will do a verification evaluation for each epcoh.

The above batch_size is defined before the notebook.

Because our dataset is relatively large and Seq2SeqTrainer will keep saving models, we need to tell it to save up to 3 models.

Finally, we need a data collector data collator to feed the processed input to the model.

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

Finally, transfer all parameters / data / models to Seq2SeqTrainer

trainer = Seq2SeqTrainer(

Call the train method for fine tuning training.



It's the next paragraph. I probably know how deep the NLP pool is. I realize my ignorance and hope it can be better and better

reference resources

Datawhale natural language processing based on transformers (Introduction to NLP)

Topics: NLP