Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/transformers_doc/en/tensorflow/sequence_classification.ipynb
Views: 2555
Text classification
Text classification is a common NLP task that assigns a label or class to text. Some of the largest companies run text classification in production for a wide range of practical applications. One of the most popular forms of text classification is sentiment analysis, which assigns a label like 🙂 positive, 🙁 negative, or 😐 neutral to a sequence of text.
This guide will show you how to:
Finetune DistilBERT on the IMDb dataset to determine whether a movie review is positive or negative.
Use your finetuned model for inference.
ALBERT, BART, BERT, BigBird, BigBird-Pegasus, BioGpt, BLOOM, CamemBERT, CANINE, ConvBERT, CTRL, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, ESM, FlauBERT, FNet, Funnel Transformer, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, GPT-J, I-BERT, LayoutLM, LayoutLMv2, LayoutLMv3, LED, LiLT, LLaMA, Longformer, LUKE, MarkupLM, mBART, MEGA, Megatron-BERT, MobileBERT, MPNet, MVP, Nezha, Nyströmformer, OpenLlama, OpenAI GPT, OPT, Perceiver, PLBart, QDQBert, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, SqueezeBERT, TAPAS, Transformer-XL, XLM, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD, YOSO
Before you begin, make sure you have all the necessary libraries installed:
We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:
Load IMDb dataset
Start by loading the IMDb dataset from the 🤗 Datasets library:
Then take a look at an example:
There are two fields in this dataset:
text
: the movie review text.label
: a value that is either0
for a negative review or1
for a positive review.
Preprocess
The next step is to load a DistilBERT tokenizer to preprocess the text
field:
Create a preprocessing function to tokenize text
and truncate sequences to be no longer than DistilBERT's maximum input length:
To apply the preprocessing function over the entire dataset, use 🤗 Datasets map function. You can speed up map
by setting batched=True
to process multiple elements of the dataset at once:
Now create a batch of examples using DataCollatorWithPadding. It's more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.
Evaluate
Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 Evaluate library. For this task, load the accuracy metric (see the 🤗 Evaluate quick tour to learn more about how to load and compute a metric):
Then create a function that passes your predictions and labels to compute to calculate the accuracy:
Your compute_metrics
function is ready to go now, and you'll return to it when you setup your training.
Train
Before you start training your model, create a map of the expected ids to their labels with id2label
and label2id
:
If you aren't familiar with finetuning a model with Keras, take a look at the basic tutorial here!
Then you can load DistilBERT with TFAutoModelForSequenceClassification along with the number of expected labels, and the label mappings:
Convert your datasets to the tf.data.Dataset
format with prepare_tf_dataset():
Configure the model for training with compile
. Note that Transformers models all have a default task-relevant loss function, so you don't need to specify one unless you want to:
The last two things to setup before you start training is to compute the accuracy from the predictions, and provide a way to push your model to the Hub. Both are done by using Keras callbacks.
Pass your compute_metrics
function to KerasMetricCallback:
Specify where to push your model and tokenizer in the PushToHubCallback:
Then bundle your callbacks together:
Finally, you're ready to start training your model! Call fit
with your training and validation datasets, the number of epochs, and your callbacks to finetune the model:
Once training is completed, your model is automatically uploaded to the Hub so everyone can use it!
For a more in-depth example of how to finetune a model for text classification, take a look at the corresponding PyTorch notebook or TensorFlow notebook.
Inference
Great, now that you've finetuned a model, you can use it for inference!
Grab some text you'd like to run inference on:
The simplest way to try out your finetuned model for inference is to use it in a pipeline(). Instantiate a pipeline
for sentiment analysis with your model, and pass your text to it:
You can also manually replicate the results of the pipeline
if you'd like:
Tokenize the text and return TensorFlow tensors:
Pass your inputs to the model and return the logits
:
Get the class with the highest probability, and use the model's id2label
mapping to convert it to a text label: