Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/deep-learning-specialization/course-5-sequence-models/Transformer_application_Named_Entity_Recognition.ipynb
Views: 34213
Transformer Network Application: Named-Entity Recognition
Welcome to Week 4's second ungraded lab. In this notebook you'll explore one application of the transformer architecture that you built in the previous assignment.
After this assignment you'll be able to:
Use tokenizers and pre-trained models from the HuggingFace Library.
Fine-tune a pre-trained transformer model for Named-Entity Recognition
1 - Named-Entity Recogniton to Process Resumes
When faced with a large amount of unstructured text data, named-entity recognition (NER) can help you detect and classify important information in your dataset. For instance, in the running example "Jane vists Africa in September", NER would help you detect "Jane", "Africa", and "September" as named-entities and classify them as person, location, and time.
You will use a variation of the Transformer model you built in the last assignment to process a large dataset of resumes.
You will find and classify relavent information such as the companies the applicant worked at, skills, type of degree, etc.
Take a look at your cleaned dataset and the categories the named-entities are matched to, or 'tags'.
Next, you will create an array of tags from your cleaned dataset. Oftentimes your input sequence will exceed the maximum length of a sequence your network can process. In this case, your sequence will be cut off, and you need to append zeroes onto the end of the shortened sequences using this Keras padding API.
1.3 - Tokenize and Align Labels with 🤗 Library
Before feeding the texts to a Transformer model, you will need to tokenize your input using a 🤗 Transformer tokenizer. It is crucial that the tokenizer you use must match the Transformer model type you are using! In this exercise, you will use the 🤗 DistilBERT fast tokenizer, which standardizes the length of your sequence to 512 and pads with zeros. Notice this matches the maximum length you used when creating tags.
Transformer models are often trained by tokenizers that split words into subwords. For instance, the word 'Africa' might get split into multiple subtokens. This can create some misalignment between the list of tags for the dataset and the list of labels generated by the tokenizer, since the tokenizer can split one word into several, or add special tokens. Before processing, it is important that you align the lists of tags and the list of labels generated by the selected tokenizer with a tokenize_and_align_labels()
function.
Exercise 1 - tokenize_and_align_labels
Implement tokenize_and_align_labels()
. The function should perform the following:
The tokenizer cuts sequences that exceed the maximum size allowed by your model with the parameter
truncation=True
Aligns the list of tags and labels with the tokenizer
word_ids
method returns a list that maps the subtokens to the original word in the sentence and special tokens toNone
.Set the labels of all the special tokens (
None
) to -100 to prevent them from affecting the loss function.Label of the first subtoken of a word and set the label for the following subtokens to -100.
Now that you have tokenized inputs, you can create train and test datasets!
Collecting seqeval
Downloading seqeval-1.2.2.tar.gz (43 kB)
|████████████████████████████████| 43 kB 10.1 MB/s eta 0:00:01
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.6/dist-packages (from seqeval) (1.18.4)
Requirement already satisfied: scikit-learn>=0.21.3 in /usr/local/lib/python3.6/dist-packages (from seqeval) (0.24.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.21.3->seqeval) (2.1.0)
Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.21.3->seqeval) (1.4.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.21.3->seqeval) (1.0.1)
Building wheels for collected packages: seqeval
Building wheel for seqeval (setup.py) ... done
Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=17585 sha256=788395b334403c261dc5bb8f2b4bdb7cfddf3a4737a2bb325b41da0f540ca2a6
Stored in directory: /root/.cache/pip/wheels/39/29/36/1c4f7905c133e11748ca375960154964082d4fb03478323089
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
WARNING: You are using pip version 20.2.4; however, version 21.3.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
Congratulations!
Here's what you should remember
Named-entity recognition (NER) detects and classifies named-entities, and can help process resumes, customer reviews, browsing histories, etc.
You must preprocess text data with the corresponding tokenizer to the pretrained model before feeding your input into your Transformer model.