Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/examples/accelerate_examples/simple_nlp_example.ipynb
Views: 2544
Before we can browse the rest of the notebook, we need to install the dependencies: this example uses datasets
and transformers
. To use TPUs on colab, we need to install torch_xla
and the last line install accelerate
from source since we the features we are using are very recent and not released yet.
Here are all the imports we will need for this notebook.
This notebook can run with any model checkpoint on the model hub that has a version with a classification head. Here we select bert-base-cased
.
The next two sections explain how we load and prepare our data for our model, If you are only interested on seeing how 🤗 Accelerate works, feel free to skip them (but make sure to execute all cells!)
Load the data
To load the dataset, we use the load_dataset
function from 🤗 Datasets. It will download and cache it (so the download won't happen if we restart the notebook).
The raw_datasets
object itself is DatasetDict
, which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of mnli
).
To access an actual element, you need to select a split first, then give an index:
To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.
Preprocess the data
Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers Tokenizer
which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.
To do all of this, we instantiate our tokenizer with the AutoTokenizer.from_pretrained
method, which will ensure:
we get a tokenizer that corresponds to the model architecture we want to use, we download the vocabulary used when pretraining this specific checkpoint. That vocabulary will be cached, so it's not downloaded again the next time we run the cell.
By default (unless you pass use_fast=Fast
to the call above) it will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.
You can directly call this tokenizer on one sentence or a pair of sentences:
Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in this tutorial if you're interested.
We can them write the function that will preprocess our samples. We just feed them to the tokenizer
with the argument truncation=True
. We also need all of our samples to have the same length (we will train on TPU and they need fixed shapes so we won't pad to the maximum length of a batch) which is done with padding=True
. The max_length
argument is used both for the truncation and padding (short inputs are padded to that length and long inputs are truncated to it).
This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:
To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the map
method of our dataset
object we created earlier. This will apply the function on all the elements of all the splits in dataset
, so our training, validation and testing data will be preprocessed in one single command.
Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass load_from_cache_file=False
in the call to map
to not use the cached files and force the preprocessing to be applied again.
Note that we passed batched=True
to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.
Lastly, we remove the columns that our model will not use. We also need to rename the label
column to labels
as this is what our model will expect.
To double-check we only have columns that are accepted as arguments for the model we will instantiate, we can look at them here.
The model we will be using is a BertModelForSequenceClassification
. We can check its signature in the Transformers documentation and all seems to be right! The last step is to set our datasets in the "torch"
format, so that each item in it is now a dictionary with tensor values.
A first look at the model
Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our tasks are about sentence classification, we use the AutoModelForSequenceClassification
class. Like with the tokenizer, the from_pretrained method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which is 2 here):
The warning is telling us we are throwing away some weights (the vocab_transform and vocab_layer_norm layers) and randomly initializing some other (the pre_classifier and classifier layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.
Note that we will are only creating the model here to look at it and debug problems. We will create the model we will train inside our training function: to train on TPU in colab, we have to create a big training function that will be executed on each code of the TPU. It's fine to do use the datasets defined before (they will be copied to each TPU core) but the model itself will need to be re-instantiated and placed on each device for it to work.
Now to get the data we need to define our training and evaluation dataloaders. Again, we only create them here for debugging purposes, they will be re-instantiated in our training function, which is why we define a function that builds them.
Let's have a look at our train and evaluation dataloaders to check a batch can go through the model.
We just loop through one batch. Since our datasets elements are dictionaries of tensors, it's the same for our batch and we can have a quick look at all the shapes. Note that this cell takes a bit of time to execute since we run a batch of our data through the model on the CPU (if you changed the checkpoint to a bigger model, it might take too much time so comment it out).
âš WARNING: Running this cell will cause training_function to malfunction, as model will be used before notebook_launcher
The output of our model is a SequenceClassifierOutput
, with the loss
(since we provided labels) and logits
(of shape 8, our batch size, by 2, the number of labels).
The last piece we will need for the model evaluation is the metric. The datasets
library provides a function load_metric
that allows us to easily create a datasets.Metric
object we can use.
To use this object on some predictions we call the compute
methode to get our metric results:
Unsurpringly, our model with its random head does not perform well, which is why we need to fine-tune it!
Fine-tuning the model
We are now ready to fine-tune this model on our dataset. As mentioned before, everything related to training needs to be in one big training function that will be executed on each TPU core, thanks to our notebook_launcher
.
It will use this dictionary of hyperparameters, so tweak anything you like in here!
The two most important things to remember for training on TPUs is that your accelerator object has to be defined inside your training function, and your model should be created outside the training function.
If you define your Accelerator in another cell that gets executed before the final launch (for debugging), you will need to restart your notebook as the line accelerator = Accelerator()
needs to be executed for the first time inside the training function spwaned on each TPU core.
This is because that line will look for a TPU device, and if you set it outside of the distributed training launched by notebook_launcher
, it will perform setup that cannot be undone in your runtime and you will only have access to one TPU core until you restart the notebook.
The reason we declare the model outside the loop is because on a TPU when launched from a notebook the same singular model object is used, and it is passed back and forth between all the cores automatically.
Since we can't explore each piece in separate cells, comments have been left in the code. This is all pretty standard and you will notice how little the code changes from a regular training loop! The main lines added are:
accelerator = Accelerator()
to initalize the distributed setup,sending all objects to
accelerator.prepare
,replace
loss.backward()
withaccelerator.backward(loss)
,use
accelerator.gather
to gather all predictions and labels before storing them in our list of predictions/labels,truncate predictions and labels as the prepared evaluation dataloader has a few more samples to make batches of the same size on each process.
The first three are for distributed training, the last two for distributed evaluation. If you don't care about distributed evaluation, you can also just replace that part by your standard evaluation loop launched on the main process only.
Other changes (which are purely cosmetic to make the output of the training readable) are:
some logging behavior behind a
if accelerator.is_main_process:
,disable the progress bar if
accelerator.is_main_process
isFalse
,use
accelerator.print
instead ofprint
.
And we're ready for launch! It's super easy with the notebook_launcher
from the Accelerate library.