CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
huggingface

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/examples/accelerate_examples/simple_nlp_example.ipynb
Views: 2544
Kernel: Python 3

Before we can browse the rest of the notebook, we need to install the dependencies: this example uses datasets and transformers. To use TPUs on colab, we need to install torch_xla and the last line install accelerate from source since we the features we are using are very recent and not released yet.

! pip install datasets transformers evaluate ! pip install cloud-tpu-client==0.10 torch==2.0.0 ! pip install https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-2.0-cp310-cp310-linux_x86_64.whl ! pip install git+https://github.com/huggingface/accelerate

Here are all the imports we will need for this notebook.

import torch from torch.utils.data import DataLoader from accelerate import Accelerator, DistributedType from datasets import load_dataset, load_metric from transformers import ( AdamW, AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed, ) from tqdm.auto import tqdm import datasets import transformers
WARNING:root:TPU has started up successfully with version pytorch-1.8

This notebook can run with any model checkpoint on the model hub that has a version with a classification head. Here we select bert-base-cased.

model_checkpoint = "bert-base-cased"

The next two sections explain how we load and prepare our data for our model, If you are only interested on seeing how 🤗 Accelerate works, feel free to skip them (but make sure to execute all cells!)

Load the data

To load the dataset, we use the load_dataset function from 🤗 Datasets. It will download and cache it (so the download won't happen if we restart the notebook).

raw_datasets = load_dataset("glue", "mrpc")
WARNING:datasets.builder:Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)

The raw_datasets object itself is DatasetDict, which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of mnli).

raw_datasets
DatasetDict({ train: Dataset({ features: ['sentence1', 'sentence2', 'label', 'idx'], num_rows: 3668 }) validation: Dataset({ features: ['sentence1', 'sentence2', 'label', 'idx'], num_rows: 408 }) test: Dataset({ features: ['sentence1', 'sentence2', 'label', 'idx'], num_rows: 1725 }) })

To access an actual element, you need to select a split first, then give an index:

raw_datasets["train"][0]
{'idx': 0, 'label': 1, 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

import datasets import random import pandas as pd from IPython.display import display, HTML def show_random_elements(dataset, num_examples=10): assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset." picks = [] for _ in range(num_examples): pick = random.randint(0, len(dataset)-1) while pick in picks: pick = random.randint(0, len(dataset)-1) picks.append(pick) df = pd.DataFrame(dataset[picks]) for column, typ in dataset.features.items(): if isinstance(typ, datasets.ClassLabel): df[column] = df[column].transform(lambda i: typ.names[i]) display(HTML(df.to_html())) show_random_elements(raw_datasets["train"])

Preprocess the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers Tokenizer which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the AutoTokenizer.from_pretrained method, which will ensure:

we get a tokenizer that corresponds to the model architecture we want to use, we download the vocabulary used when pretraining this specific checkpoint. That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

By default (unless you pass use_fast=Fast to the call above) it will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

You can directly call this tokenizer on one sentence or a pair of sentences:

tokenizer("Hello, this one sentence!", "And this sentence goes with it.")
{'input_ids': [101, 8667, 117, 1142, 1141, 5650, 106, 102, 1262, 1142, 5650, 2947, 1114, 1122, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in this tutorial if you're interested.

We can them write the function that will preprocess our samples. We just feed them to the tokenizer with the argument truncation=True. We also need all of our samples to have the same length (we will train on TPU and they need fixed shapes so we won't pad to the maximum length of a batch) which is done with padding=True. The max_length argument is used both for the truncation and padding (short inputs are padded to that length and long inputs are truncated to it).

def tokenize_function(examples): outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length", max_length=128) return outputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

tokenize_function(raw_datasets['train'][:5])
{'input_ids': [[101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 11336, 6732, 3384, 1106, 1140, 1112, 1178, 107, 1103, 7737, 107, 117, 7277, 2180, 5303, 4806, 1117, 1711, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 10684, 2599, 9717, 1161, 2205, 11288, 1377, 112, 188, 1196, 4147, 1103, 4129, 1106, 19770, 2787, 1107, 1772, 1111, 109, 123, 119, 126, 3775, 119, 102, 10684, 2599, 9717, 1161, 3306, 11288, 1377, 112, 188, 1107, 1876, 1111, 109, 5691, 1495, 1550, 1105, 1962, 1122, 1106, 19770, 2787, 1111, 109, 122, 119, 129, 3775, 1107, 1772, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1220, 1125, 1502, 1126, 16355, 1113, 1103, 4639, 1113, 1340, 1275, 117, 4733, 1103, 6527, 1111, 4688, 117, 1119, 1896, 119, 102, 1212, 1340, 1275, 117, 1103, 2062, 112, 188, 5032, 1125, 1502, 1126, 16355, 1113, 1103, 4639, 117, 4733, 1103, 16454, 1111, 4688, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 5596, 5347, 19297, 14748, 1942, 117, 22515, 1830, 6117, 1127, 1146, 1627, 18748, 117, 1137, 125, 119, 125, 110, 117, 1120, 138, 109, 125, 119, 4376, 117, 1515, 2206, 1383, 170, 1647, 1344, 1104, 138, 109, 125, 119, 4667, 119, 102, 22515, 1830, 6117, 4874, 1406, 18748, 117, 1137, 125, 119, 127, 110, 117, 1106, 1383, 170, 1647, 5134, 1344, 1120, 138, 109, 125, 119, 4667, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1109, 4482, 3152, 109, 123, 119, 1429, 117, 1137, 1164, 1429, 3029, 117, 1106, 1601, 5286, 1120, 109, 1626, 119, 4062, 1113, 1103, 1203, 1365, 9924, 7855, 119, 102, 153, 2349, 111, 142, 13619, 119, 6117, 4874, 109, 122, 119, 5519, 1137, 129, 3029, 1106, 109, 1626, 119, 5347, 1113, 1103, 1203, 1365, 9924, 7855, 1113, 5286, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the map method of our dataset object we created earlier. This will apply the function on all the elements of all the splits in dataset, so our training, validation and testing data will be preprocessed in one single command.

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True, remove_columns=["idx", "sentence1", "sentence2"])

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass load_from_cache_file=False in the call to map to not use the cached files and force the preprocessing to be applied again.

Note that we passed batched=True to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

Lastly, we remove the columns that our model will not use. We also need to rename the label column to labels as this is what our model will expect.

tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

To double-check we only have columns that are accepted as arguments for the model we will instantiate, we can look at them here.

tokenized_datasets["train"].features
{'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'labels': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

The model we will be using is a BertModelForSequenceClassification. We can check its signature in the Transformers documentation and all seems to be right! The last step is to set our datasets in the "torch" format, so that each item in it is now a dictionary with tensor values.

tokenized_datasets.set_format("torch")

A first look at the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our tasks are about sentence classification, we use the AutoModelForSequenceClassification class. Like with the tokenizer, the from_pretrained method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which is 2 here):

from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias'] - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

The warning is telling us we are throwing away some weights (the vocab_transform and vocab_layer_norm layers) and randomly initializing some other (the pre_classifier and classifier layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

Note that we will are only creating the model here to look at it and debug problems. We will create the model we will train inside our training function: to train on TPU in colab, we have to create a big training function that will be executed on each code of the TPU. It's fine to do use the datasets defined before (they will be copied to each TPU core) but the model itself will need to be re-instantiated and placed on each device for it to work.

Now to get the data we need to define our training and evaluation dataloaders. Again, we only create them here for debugging purposes, they will be re-instantiated in our training function, which is why we define a function that builds them.

def create_dataloaders(train_batch_size=8, eval_batch_size=32): train_dataloader = DataLoader( tokenized_datasets["train"], shuffle=True, batch_size=train_batch_size ) eval_dataloader = DataLoader( tokenized_datasets["validation"], shuffle=False, batch_size=eval_batch_size ) return train_dataloader, eval_dataloader

Let's have a look at our train and evaluation dataloaders to check a batch can go through the model.

train_dataloader, eval_dataloader = create_dataloaders()

We just loop through one batch. Since our datasets elements are dictionaries of tensors, it's the same for our batch and we can have a quick look at all the shapes. Note that this cell takes a bit of time to execute since we run a batch of our data through the model on the CPU (if you changed the checkpoint to a bigger model, it might take too much time so comment it out).

âš  WARNING: Running this cell will cause training_function to malfunction, as model will be used before notebook_launcher

for batch in train_dataloader: print({k: v.shape for k, v in batch.items()}) outputs = model(**batch) break
{'attention_mask': torch.Size([8, 128]), 'input_ids': torch.Size([8, 128]), 'labels': torch.Size([8]), 'token_type_ids': torch.Size([8, 128])}

The output of our model is a SequenceClassifierOutput, with the loss (since we provided labels) and logits (of shape 8, our batch size, by 2, the number of labels).

outputs
SequenceClassifierOutput([('loss', tensor(0.5652, grad_fn=<NllLossBackward>)), ('logits', tensor([[-0.5161, -0.1519], [-0.5209, -0.1433], [-0.5311, -0.1388], [-0.5313, -0.1394], [-0.5311, -0.1345], [-0.5342, -0.1251], [-0.5255, -0.1308], [-0.5352, -0.1255]], grad_fn=<AddmmBackward>))])

The last piece we will need for the model evaluation is the metric. The datasets library provides a function load_metric that allows us to easily create a datasets.Metric object we can use.

metric = load_metric("glue", "mrpc")

To use this object on some predictions we call the compute methode to get our metric results:

predictions = outputs.logits.detach().argmax(dim=-1) metric.compute(predictions=predictions, references=batch["labels"])
{'accuracy': 0.875, 'f1': 0.9333333333333333}

Unsurpringly, our model with its random head does not perform well, which is why we need to fine-tune it!

Fine-tuning the model

We are now ready to fine-tune this model on our dataset. As mentioned before, everything related to training needs to be in one big training function that will be executed on each TPU core, thanks to our notebook_launcher.

It will use this dictionary of hyperparameters, so tweak anything you like in here!

hyperparameters = { "learning_rate": 2e-5, "num_epochs": 3, "train_batch_size": 8, # Actual batch size will this x 8 "eval_batch_size": 32, # Actual batch size will this x 8 "seed": 42, }

The two most important things to remember for training on TPUs is that your accelerator object has to be defined inside your training function, and your model should be created outside the training function.

If you define your Accelerator in another cell that gets executed before the final launch (for debugging), you will need to restart your notebook as the line accelerator = Accelerator() needs to be executed for the first time inside the training function spwaned on each TPU core.

This is because that line will look for a TPU device, and if you set it outside of the distributed training launched by notebook_launcher, it will perform setup that cannot be undone in your runtime and you will only have access to one TPU core until you restart the notebook.

The reason we declare the model outside the loop is because on a TPU when launched from a notebook the same singular model object is used, and it is passed back and forth between all the cores automatically.

Since we can't explore each piece in separate cells, comments have been left in the code. This is all pretty standard and you will notice how little the code changes from a regular training loop! The main lines added are:

  • accelerator = Accelerator() to initalize the distributed setup,

  • sending all objects to accelerator.prepare,

  • replace loss.backward() with accelerator.backward(loss),

  • use accelerator.gather to gather all predictions and labels before storing them in our list of predictions/labels,

  • truncate predictions and labels as the prepared evaluation dataloader has a few more samples to make batches of the same size on each process.

The first three are for distributed training, the last two for distributed evaluation. If you don't care about distributed evaluation, you can also just replace that part by your standard evaluation loop launched on the main process only.

Other changes (which are purely cosmetic to make the output of the training readable) are:

  • some logging behavior behind a if accelerator.is_main_process:,

  • disable the progress bar if accelerator.is_main_process is False,

  • use accelerator.print instead of print.

def training_function(model): # Initialize accelerator accelerator = Accelerator() # To have only one message (and not 8) per logs of Transformers or Datasets, we set the logging verbosity # to INFO for the main process only. if accelerator.is_main_process: datasets.utils.logging.set_verbosity_warning() transformers.utils.logging.set_verbosity_info() else: datasets.utils.logging.set_verbosity_error() transformers.utils.logging.set_verbosity_error() train_dataloader, eval_dataloader = create_dataloaders( train_batch_size=hyperparameters["train_batch_size"], eval_batch_size=hyperparameters["eval_batch_size"] ) # The seed need to be set before we instantiate the model, as it will determine the random head. set_seed(hyperparameters["seed"]) # Instantiate optimizer optimizer = AdamW(params=model.parameters(), lr=hyperparameters["learning_rate"]) # Prepare everything # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the # prepare method. model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader ) num_epochs = hyperparameters["num_epochs"] # Instantiate learning rate scheduler after preparing the training dataloader as the prepare method # may change its length. lr_scheduler = get_linear_schedule_with_warmup( optimizer=optimizer, num_warmup_steps=100, num_training_steps=len(train_dataloader) * num_epochs, ) # Instantiate a progress bar to keep track of training. Note that we only enable it on the main # process to avoid having 8 progress bars. progress_bar = tqdm(range(num_epochs * len(train_dataloader)), disable=not accelerator.is_main_process) # Now we train the model for epoch in range(num_epochs): model.train() for step, batch in enumerate(train_dataloader): outputs = model(**batch) loss = outputs.loss accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() progress_bar.update(1) model.eval() all_predictions = [] all_labels = [] for step, batch in enumerate(eval_dataloader): with torch.no_grad(): outputs = model(**batch) predictions = outputs.logits.argmax(dim=-1) # We gather predictions and labels from the 8 TPUs to have them all. all_predictions.append(accelerator.gather(predictions)) all_labels.append(accelerator.gather(batch["labels"])) # Concatenate all predictions and labels. # The last thing we need to do is to truncate the predictions and labels we concatenated # together as the prepared evaluation dataloader has a little bit more elements to make # batches of the same size on each process. all_predictions = torch.cat(all_predictions)[:len(tokenized_datasets["validation"])] all_labels = torch.cat(all_labels)[:len(tokenized_datasets["validation"])] eval_metric = metric.compute(predictions=all_predictions, references=all_labels) # Use accelerator.print to print only on the main process. accelerator.print(f"epoch {epoch}:", eval_metric)

And we're ready for launch! It's super easy with the notebook_launcher from the Accelerate library.

from accelerate import notebook_launcher notebook_launcher(training_function, (model,))
loading configuration file https://huggingface.co/bert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/a803e0468a8fe090683bdc453f4fac622804f49de86d7cecaee92365d4a0f829.a64a22196690e0e82ead56f388a3ef3a50de93335926ccfa20610217db589307 Model config BertConfig { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "gradient_checkpointing": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "position_embedding_type": "absolute", "transformers_version": "4.5.1", "type_vocab_size": 2, "use_cache": true, "vocab_size": 28996 } loading weights file https://huggingface.co/bert-base-cased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/092cc582560fc3833e556b3f833695c26343cb54b7e88cd02d40821462a74999.1f48cab6c959fc6c360d22bea39d06959e90f5b002e77e836d2da45464875cda Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias'] - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
epoch 0: {'accuracy': 0.7303921568627451, 'f1': 0.8286604361370716} epoch 1: {'accuracy': 0.7892156862745098, 'f1': 0.8621794871794871} epoch 2: {'accuracy': 0.821078431372549, 'f1': 0.8726003490401396}