1 - text classification + hyper parameter search

Posted by wendu on Sat, 18 Dec 2021 23:53:55 +0100

If you are opening this notebook locally, please ensure that you have installed the above dependent packages.
You can also here Find the multi GPU distributed training version of this notebook.

pip install transformers datasets

Fine tune the pre training model for text classification

We'll show you how to use 🤗 Transformers The model in the code base is used to solve the text classification task, which comes from GLUE Benchmark.

[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (IMG iotplii9-1629863789278)( https://github.com/huggingface/notebooks/blob/master/examples/images/text_classification.png?raw=1 )]

The GLUE list contains 9 sentence level classification tasks, which are:

CoLA Identify whether a sentence is grammatically correct
MNLI (multi gene natural language influence) given a hypothesis, judge the relationship between another sentence and the hypothesis: entries, contracts or unrelated.
MRPC (Microsoft Research Paraphrase Corpus) judge whether two sentences are paraphrases
QNLI (question answering natural language influence) judge whether the second sentence contains the answer to the first question.
QQP (Quora Question Pairs2) judge whether the two questions have the same semantics.
RTE (recognizing textual entry) judge whether a sentence has an entail relationship with a hypothesis.
SST-2 (Stanford sentient treebank) judge the positive and negative emotions of a sentence
STS-B (Semantic Textual Similarity Benchmark) judge the similarity of two sentences (score is 1-5 points).
WNLI (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not.

For the above tasks, we will show how to load datasets using a simple Dataset library and fine tune the pre training model using the Trainer interface in transformer.

GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

This notebook can theoretically use a variety of transformer models( Model panel ), solve any text classification tasks.

If the tasks you are dealing with are different, you can use this notebook with only a small change. At the same time, you should adjust the btach size required for fine-tuning training according to your GPU video memory to avoid video memory overflow.

task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

Load data

We will use 🤗 Datasets Library to load data and corresponding evaluation methods. Data loading and evaluation loading only need to use load_dataset and load_metric is enough.

from datasets import load_dataset, load_metric

Except mnli mm, other tasks can be loaded directly by task name. Data is automatically cached after loading.

actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric('glue', actual_task)

The datasets object itself is a DatasetDict Data structure For training set, validation set and test set, you only need to use the corresponding key (train, validation, test) to get the corresponding data.

dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

Given a key (train, validation or test) and subscript of data segmentation, you can view the data.

dataset["train"][0]

{'idx': 0,
 'label': 1,
 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}

In order to further understand what the data looks like, the following function will randomly select several examples from the data set.

import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(dataset["train"])

	sentence	label	idx
0	The more I talk to Joe, the less about linguistics I am inclined to think Sally has taught him to appreciate.	acceptable	196
1	Have in our class the kids arrived safely?	unacceptable	3748
2	I gave Mary a book.	acceptable	5302
3	Every student, who attended the party, had a good time.	unacceptable	4944
4	Bill pounded the metal fiat.	acceptable	2178
5	It bit me on the leg.	acceptable	5908
6	The boys were made a good mother by Aunt Mary.	unacceptable	736
7	More of a man is here.	unacceptable	5403
8	My mother baked me a birthday cake.	acceptable	3761
9	Gregory appears to have wanted to be loyal to the company.	acceptable	4334

The evaluation metric is datasets.Metric An example of:

metric

Directly call the compute method of metric and pass in labels and predictions to get the value of metric:

import numpy as np

fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

{'matthews_correlation': 0.1513518081969605}

The corresponding metric of each text classification task is different, as follows:

for CoLA: Matthews Correlation Coefficient
for MNLI (matched or mismatched): Accuracy
for MRPC: Accuracy and F1 score
for QNLI: Accuracy
for QQP: Accuracy and F1 score
for RTE: Accuracy
for SST-2: Accuracy
for STS-B: Pearson Correlation Coefficient and Spearman's_Rank_Correlation_Coefficient
for WNLI: Accuracy

So be sure to align metric with the task

Data preprocessing

Before feeding the data into the model, we need to preprocess the data. The preprocessing tool is tokenizer. Tokenizer first tokenizes the input, then converts the tokens into the corresponding token ID in the pre model, and then into the input format required by the model.

In order to achieve the purpose of data preprocessing, we use autotokenizer from_ The pretrained method instantiates our tokenizer to ensure that:

We get a tokenizer corresponding to the pre training model one by one.
When using the tokenizer corresponding to the specified model checkpoint, we also downloaded the thesaurus vocabulary required by the model, which is exactly tokens vocabulary.

The downloaded tokens volatile will be cached so that they will not be downloaded again when they are used again.

from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Note: use_fast=True requires tokenizer to be transformers Pretrainedtokenizerfast type, because we need to use some special features of fast tokenizer (such as multi-threaded fast tokenizer) during preprocessing. If the corresponding model does not have fast tokenizer, remove this option.

Almost all tokenizers corresponding to models have corresponding fast tokenizer s. We can Model tokenizer correspondence table Check the characteristics of tokenizer corresponding to all pre training models.

Tokenizer can preprocess either a single text or a pair of text. The data obtained by tokenizer after preprocessing meets the input format of pre training model

tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Depending on the pre training model we choose, we will see that tokenizer has different returns. Tokenizer and pre training model correspond to each other one by one. More information can be found in here Learn.

In order to preprocess our data, we need to know different data and corresponding data formats, so we define the following dict.

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

Check the data format:

sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence: Our friends won't buy this analysis, let alone the next one we propose.

Then put the preprocessed code into a function:

def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

The preprocessing function can process a single sample or multiple samples. If the input is multiple samples, a list will be returned:

preprocess_function(dataset['train'][:5])

{'input_ids': [[101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 1998, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 2030, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 1996, 2062, 2057, 2817, 16025, 1010, 1996, 13675, 16103, 2121, 2027, 2131, 1012, 102], [101, 2154, 2011, 2154, 1996, 8866, 2024, 2893, 14163, 8024, 3771, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

Next, preprocess all samples in the dataset datasets by using the map function to prepare the preprocessing function_ train_ Features are applied to all samples.

encoded_dataset = dataset.map(preprocess_function, batched=True)

Even better, the returned results will be automatically cached, Avoid recalculation the next time you process (note, however, that if the input is changed, it may be affected by the cache!). The datasets library function will detect the input parameters to determine whether there is any change. If there is no change, the cached data will be used, and if there is any change, it will be reprocessed. However, if the input parameters remain unchanged and you want to change the input, you'd better clean up the cache. The cleaning method is to use load_from_cache_file=False parameter. In addition, the parameter batched=True used above is the characteristic of tokenizer, which is thought to use multiple threads to process input in parallel at the same time.

Fine tuning pre training model

Now that the data is ready, we need to download and load our pre training model, and then fine tune the pre training model. Since we are doing the seq2seq task, we need a model class that can solve this task. We use the AutoModelForSequenceClassification class. Similar to tokenizer, from_ The pre trained method can also help us download and load models. At the same time, it will also cache models, so we won't download models repeatedly.

It should be noted that STS-B is a regression problem and MNLI is a 3 classification problem:

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Since our fine-tuning task is the text classification task, and we load the pre-trained language model, we will be prompted to throw away some mismatched neural network parameters when loading the model (for example, the neural network head of the pre-trained language model is thrown away, and the neural network head of text classification is initialized randomly).

In order to get a Trainer training tool, we also need three elements, the most important of which is the setting / parameters of training TrainingArguments . This training setting contains all the attributes that can define the training process.

metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"

args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

Evaluation above_ The strategy = "epoch" parameter tells the training code that we will make a verification evaluation for each epcoh.

Batch above_ Size is defined before the notebook.

Finally, because different tasks require different evaluation indicators, we set a function to get the evaluation method according to the task name:

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

Pass all to Trainer:

validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Start training:

trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running training *****
  Num examples = 8551
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2675




<div>

  <progress value='2675' max='2675' style='width:300px; height:20px; vertical-align: middle;'></progress>
  [2675/2675 02:49, Epoch 5/5]
</div>
<table border="1" class="dataframe">

Epoch Training Loss Validation Loss Matthews Correlation 1 0.525400 0.520955 0.409248 2 0.351600 0.570341 0.477499 3 0.236100 0.622785 0.499872 4 0.166300 0.806475 0.491623 5 0.125700 0.882225 0.513900

Evaluation after training:

trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16

[66/66 00:00]

{'epoch': 5.0,
 'eval_loss': 0.8822253346443176,
 'eval_matthews_correlation': 0.5138995234247261,
 'eval_runtime': 0.9319,
 'eval_samples_per_second': 1119.255,
 'eval_steps_per_second': 70.825}

To see how your model fared you can compare it to the GLUE Benchmark leaderboard.

Superparametric search

Trainer also supports hyper parameter search, using optuna or Ray Tune Code base.

Uncomment the installation dependencies in the following two lines:

! pip install optuna
! pip install ray[tune]

During hyper parameter search, the Trainer will return multiple trained models, so a defined model needs to be passed in so that the Trainer can continuously reinitialize the passed in model:

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Similar to the previous call to Trainer:

trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


Call method`hyperparameter_search`. Note that this process may take a long time. We can use some data sets for hyperparametric search first, and then conduct full training.
For example, use 1/10 Search for new data:


```python
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

hyperparameter_search will return parameters related to the best model:

best_run

Set trainer as the best parameter found for training:

for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()

Finally, don't forget to check how to upload the model to]( https://huggingface.co/transformers/model_sharing.html )To 🤗 Model Hub

Topics: Deep Learning NLP

Programmer Think