Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/deep-learning-specialization/course-5-sequence-models/Emoji_v3a.ipynb
Views: 34203
Emojify!
Welcome to the second assignment of Week 2! You're going to use word vector representations to build an Emojifier. ๐คฉ ๐ซ ๐ฅ
Have you ever wanted to make your text messages more expressive? Your emojifier app will help you do that. Rather than writing:
"Congratulations on the promotion! Let's get coffee and talk. Love you!"
The emojifier can automatically turn this into:
"Congratulations on the promotion! ๐ Let's get coffee and talk. โ๏ธ Love you! โค๏ธ"
You'll implement a model which inputs a sentence (such as "Let's go see the baseball game tonight!") and finds the most appropriate emoji to be used with this sentence (โพ๏ธ).
Using Word Vectors to Improve Emoji Lookups
In many emoji interfaces, you need to remember that โค๏ธ is the "heart" symbol rather than the "love" symbol.
In other words, you'll have to remember to type "heart" to find the desired emoji, and typing "love" won't bring up that symbol.
You can make a more flexible emoji interface by using word vectors!
When using word vectors, you'll see that even if your training set explicitly relates only a few words to a particular emoji, your algorithm will be able to generalize and associate additional words in the test set to the same emoji.
This works even if those additional words don't even appear in the training set.
This allows you to build an accurate classifier mapping from sentences to emojis, even using a small training set.
What you'll build:
In this exercise, you'll start with a baseline model (Emojifier-V1) using word embeddings.
Then you will build a more sophisticated model (Emojifier-V2) that further incorporates an LSTM.
By the end of this notebook, you'll be able to:
Create an embedding layer in Keras with pre-trained word vectors
Explain the advantages and disadvantages of the GloVe algorithm
Describe how negative sampling learns word vectors more efficiently than other methods
Build a sentiment classifier using word embeddings
Build and train a more sophisticated classifier using an LSTM
๐ ๐
๐ ๐
(^^^ Emoji for "skills")
Important Note on Submission to the AutoGrader
Before submitting your assignment to the AutoGrader, please make sure you are not doing the following:
You have not added any extra
print
statement(s) in the assignment.You have not added any extra code cell(s) in the assignment.
You have not changed any of the function parameters.
You are not using any global variables inside your graded exercises. Unless specifically instructed to do so, please refrain from it and use the local variables instead.
You are not changing the assignment code where it is not required, like creating extra variables.
If you do any of the following, you will get something like, Grader not found
(or similarly unexpected) error upon submitting your assignment. Before asking for help/debugging the errors in your assignment, check for these first. If this is the case, and you don't remember the changes you have made, you can get a fresh copy of the assignment by following these instructions.
Table of Contents
1 - Baseline Model: Emojifier-V1
1.1 - Dataset EMOJISET
Let's start by building a simple baseline classifier.
You have a tiny dataset (X, Y) where:
X contains 127 sentences (strings).
Y contains an integer label between 0 and 4 corresponding to an emoji for each sentence.
Load the dataset using the code below. The dataset is split between training (127 examples) and testing (56 examples).
Run the following cell to print sentences from X_train and corresponding labels from Y_train.
Change
idx
to see different examples.Note that due to the font used by iPython notebook, the heart emoji may be colored black rather than red.
1.2 - Overview of the Emojifier-V1
In this section, you'll implement a baseline model called "Emojifier-v1".
Inputs and Outputs
The input of the model is a string corresponding to a sentence (e.g. "I love you").
The output will be a probability vector of shape (1,5), (indicating that there are 5 emojis to choose from).
The (1,5) probability vector is passed to an argmax layer, which extracts the index of the emoji with the highest probability.
One-hot Encoding
To get your labels into a format suitable for training a softmax classifier, convert from its current shape into a "one-hot representation" ,
Each row is a one-hot vector giving the label of one example.
Here,
Y_oh
stands for "Y-one-hot" in the variable namesY_oh_train
andY_oh_test
:
Now, see what convert_to_one_hot()
did. Feel free to change index
to print out different values.
All the data is now ready to be fed into the Emojify-V1 model. You're ready to implement the model!
1.3 - Implementing Emojifier-V1
As shown in Figure 2 (above), the first step is to:
Convert each word in the input sentence into their word vector representations.
Take an average of the word vectors.
Similar to this week's previous assignment, you'll use pre-trained 50-dimensional GloVe embeddings.
Run the following cell to load the word_to_vec_map
, which contains all the vector representations.
You've loaded:
word_to_index
: dictionary mapping from words to their indices in the vocabulary(400,001 words, with the valid indices ranging from 0 to 400,000)
index_to_word
: dictionary mapping from indices to their corresponding words in the vocabularyword_to_vec_map
: dictionary mapping words to their GloVe vector representation.
Run the following cell to check if it works:
Exercise 1 - sentence_to_avg
Implement sentence_to_avg()
You'll need to carry out two steps:
Convert every sentence to lower-case, then split the sentence into a list of words.
X.lower()
andX.split()
might be useful. ๐
For each word in the sentence, access its GloVe representation.
Then take the average of all of these word vectors.
You might use
numpy.zeros()
, which you can read more about here.
Additional Hints
When creating the
avg
array of zeros, you'll want it to be a vector of the same shape as the other word vectors in theword_to_vec_map
.You can choose a word that exists in the
word_to_vec_map
and access its.shape
field.Be careful not to hard-code the word that you access. In other words, don't assume that if you see the word 'the' in the
word_to_vec_map
within this notebook, that this word will be in theword_to_vec_map
when the function is being called by the automatic grader.
Hint: you can use any one of the word vectors that you retrieved from the input sentence
to find the shape of a word vector.
avg =
[-0.008005 0.56370833 -0.50427333 0.258865 0.55131103 0.03104983
-0.21013718 0.16893933 -0.09590267 0.141784 -0.15708967 0.18525867
0.6495785 0.38371117 0.21102167 0.11301667 0.02613967 0.26037767
0.05820667 -0.01578167 -0.12078833 -0.02471267 0.4128455 0.5152061
0.38756167 -0.898661 -0.535145 0.33501167 0.68806933 -0.2156265
1.797155 0.10476933 -0.36775333 0.750785 0.10282583 0.348925
-0.27262833 0.66768 -0.10706167 -0.283635 0.59580117 0.28747333
-0.3366635 0.23393817 0.34349183 0.178405 0.1166155 -0.076433
0.1445417 0.09808667]
All tests passed!
1.4 - Implement the Model
You now have all the pieces to finish implementing the model()
function! After using sentence_to_avg()
you need to:
Pass the average through forward propagation
Compute the cost
Backpropagate to update the softmax parameters
Exercise 2 - model
Implement the model()
function described in Figure (2).
The equations you need to implement in the forward pass and to compute the cross-entropy cost are below:
The variable ("Y one hot") is the one-hot encoding of the output labels.
Note: It is possible to come up with a more efficient vectorized implementation. For now, just use nested for loops to better understand the algorithm, and for easier debugging.
The function softmax()
is provided, and has already been imported.
Epoch: 0 --- cost = 0.048117668529347436
Accuracy: 0.9166666666666666
Epoch: 100 --- cost = 0.0010109208124973721
Accuracy: 1.0
All tests passed!
Run the next cell to train your model and learn the softmax parameters (W, b). The training process will take about 5 minutes
Great! Your model has pretty high accuracy on the training set. Now see how it does on the test set:
Note:
Random guessing would have had 20% accuracy, given that there are 5 classes. (1/5 = 20%).
This is pretty good performance after training on only 127 examples.
The Model Matches Emojis to Relevant Words
In the training set, the algorithm saw the sentence
"I love you."
with the label โค๏ธ.
You can check that the word "cherish" does not appear in the training set.
Nonetheless, let's see what happens if you write "I cherish you."
Amazing!
Because adore has a similar embedding as love, the algorithm has generalized correctly even to a word it has never seen before.
Words such as heart, dear, beloved or adore have embedding vectors similar to love.
Feel free to modify the inputs above and try out a variety of input sentences.
How well does it work?
Word Ordering isn't Considered in this Model
Note that the model doesn't get the following sentence correct:
"not feeling happy"
This algorithm ignores word ordering, so is not good at understanding phrases like "not happy."
Confusion Matrix
Printing the confusion matrix can also help understand which classes are more difficult for your model.
A confusion matrix shows how often an example whose label is one class ("actual" class) is mislabeled by the algorithm with a different class ("predicted" class).
Print the confusion matrix below:
What you should remember:
Even with a mere 127 training examples, you can get a reasonably good model for Emojifying.
This is due to the generalization power word vectors gives you.
Emojify-V1 will perform poorly on sentences such as "This movie is not good and not enjoyable"
It doesn't understand combinations of words.
It just averages all the words' embedding vectors together, without considering the ordering of words.
Not to worry! You will build a better algorithm in the next section!
2 - Emojifier-V2: Using LSTMs in Keras
You're going to build an LSTM model that takes word sequences as input! This model will be able to account for word ordering.
Emojifier-V2 will continue to use pre-trained word embeddings to represent words. You'll feed word embeddings into an LSTM, and the LSTM will learn to predict the most appropriate emoji.
Packages
Run the following cell to load the Keras packages you'll need:
2.2 Keras and Mini-batching
In this exercise, you want to train Keras using mini-batches. However, most deep learning frameworks require that all sequences in the same mini-batch have the same length.
This is what allows vectorization to work: If you had a 3-word sentence and a 4-word sentence, then the computations needed for them are different (one takes 3 steps of an LSTM, one takes 4 steps) so it's just not possible to do them both at the same time.
Padding Handles Sequences of Varying Length
The common solution to handling sequences of different length is to use padding. Specifically:
Set a maximum sequence length
Pad all sequences to have the same length.
Example of Padding:
Given a maximum sequence length of 20, you could pad every sentence with "0"s so that each input sentence is of length 20.
Thus, the sentence "I love you" would be represented as .
In this example, any sentences longer than 20 words would have to be truncated.
One way to choose the maximum sequence length is to just pick the length of the longest sentence in the training set.
2.3 - The Embedding Layer
In Keras, the embedding matrix is represented as a "layer."
The embedding matrix maps word indices to embedding vectors.
The word indices are positive integers.
The embedding vectors are dense vectors of fixed size.
A "dense" vector is the opposite of a sparse vector. It means that most of its values are non-zero. As a counter-example, a one-hot encoded vector is not "dense."
The embedding matrix can be derived in two ways:
Training a model to derive the embeddings from scratch.
Using a pretrained embedding.
Using and Updating Pre-trained Embeddings
In this section, you'll create an Embedding() layer in Keras
You will initialize the Embedding layer with GloVe 50-dimensional vectors.
In the code below, you'll observe how Keras allows you to either train or leave this layer fixed.
Because your training set is quite small, you'll leave the GloVe embeddings fixed instead of updating them.
Inputs and Outputs to the Embedding Layer
The
Embedding()
layer's input is an integer matrix of size (batch size, max input length).This input corresponds to sentences converted into lists of indices (integers).
The largest integer (the highest word index) in the input should be no larger than the vocabulary size.
The embedding layer outputs an array of shape (batch size, max input length, dimension of word vectors).
The figure shows the propagation of two example sentences through the embedding layer.
Both examples have been zero-padded to a length of
max_len=5
.The word embeddings are 50 units in length.
The final dimension of the representation is
(2,max_len,50)
.
Prepare the Input Sentences
Exercise 3 - sentences_to_indices
Implement sentences_to_indices
This function processes an array of sentences X and returns inputs to the embedding layer:
Convert each training sentences into a list of indices (the indices correspond to each word in the sentence)
Zero-pad all these lists so that their length is the length of the longest sentence.
Additional Hints:
Note that you may have considered using the
enumerate()
function in the for loop, but for the purposes of passing the autograder, please follow the starter code by initializing and incrementingj
explicitly.
[[1. 2. 4. 3.]
[4. 8. 6. 5.]
[5. 3. 7. 0.]]
All tests passed!
Expected value
Run the following cell to check what sentences_to_indices()
does, and take a look at your results.
Build Embedding Layer
Now you'll build the Embedding()
layer in Keras, using pre-trained word vectors.
The embedding layer takes as input a list of word indices.
sentences_to_indices()
creates these word indices.
The embedding layer will return the word embeddings for a sentence.
Exercise 4 - pretrained_embedding_layer
Implement pretrained_embedding_layer()
with these steps:
Initialize the embedding matrix as a numpy array of zeros.
The embedding matrix has a row for each unique word in the vocabulary.
There is one additional row to handle "unknown" words.
So vocab_size is the number of unique words plus one.
Each row will store the vector representation of one word.
For example, one row may be 50 positions long if using GloVe word vectors.
In the code below,
emb_dim
represents the length of a word embedding.
Fill in each row of the embedding matrix with the vector representation of a word
Each word in
word_to_index
is a string.word_to_vec_map is a dictionary where the keys are strings and the values are the word vectors.
Define the Keras embedding layer.
Use Embedding().
The input dimension is equal to the vocabulary length (number of unique words plus one).
The output dimension is equal to the number of positions in a word embedding.
Make this layer's embeddings fixed.
If you were to set
trainable = True
, then it will allow the optimization algorithm to modify the values of the word embeddings.In this case, you don't want the model to modify the word embeddings.
Set the embedding weights to be equal to the embedding matrix.
Note that this is part of the code is already completed for you and does not need to be modified!
All tests passed!
2.4 - Building the Emojifier-V2
Now you're ready to build the Emojifier-V2 model, in which you feed the embedding layer's output to an LSTM network!
Exercise 5 - Emojify_V2
Implement Emojify_V2()
This function builds a Keras graph of the architecture shown in Figure (3).
The model takes as input an array of sentences of shape (
m
,max_len
, ) defined byinput_shape
.The model outputs a softmax probability vector of shape (
m
,C = 5
).You may need to use the following Keras layers:
Set the
shape
anddtype
parameters.The inputs are integers, so you can specify the data type as a string, 'int32'.
Set the
units
andreturn_sequences
parameters.
Set the
rate
parameter.
Set the
units
,Note that
Dense()
has anactivation
parameter. For the purposes of passing the autograder, please do not set the activation withinDense()
. Use the separateActivation
layer to do so.
You can pass in the activation of your choice as a lowercase string.
Set
inputs
andoutputs
.
Additional Hints
Remember that these Keras layers return an object, and you will feed in the outputs of the previous layer as the input arguments to that object. The returned object can be created and called in the same line.
The
embedding_layer
that is returned bypretrained_embedding_layer
is a layer object that can be called as a function, passing in a single argument (sentence indices).Here is some sample code in case you're stuck: ๐
All tests passed!
Run the following cell to create your model and check its summary.
Because all sentences in the dataset are less than 10 words,
max_len = 10
was chosen.You should see that your architecture uses 20,223,927 parameters, of which 20,000,050 (the word embeddings) are non-trainable, with the remaining 223,877 being trainable.
Because your vocabulary size has 400,001 words (with valid indices from 0 to 400,000) there are 400,001*50 = 20,000,050 non-trainable parameters.
Compile the Model
As usual, after creating your model in Keras, you need to compile it and define what loss, optimizer and metrics you want to use. Compile your model using categorical_crossentropy
loss, adam
optimizer and ['accuracy']
metrics:
2.5 - Train the Model
It's time to train your model! Your Emojifier-V2 model
takes as input an array of shape (m
, max_len
) and outputs probability vectors of shape (m
, number of classes
). Thus, you have to convert X_train (array of sentences as strings) to X_train_indices (array of sentences as list of word indices), and Y_train (labels as indices) to Y_train_oh (labels as one-hot vectors).
Fit the Keras model on X_train_indices
and Y_train_oh
, using epochs = 50
and batch_size = 32
.
Your model should perform around 90% to 100% accuracy on the training set. Exact model accuracy may vary!
Run the following cell to evaluate your model on the test set:
You should get a test accuracy between 80% and 95%. Run the cell below to see the mislabelled examples:
Now you can try it on your own example! Write your own sentence below:
LSTM Version Accounts for Word Order
The Emojify-V1 model did not "not feeling happy" correctly, but your implementation of Emojify-V2 got it right!
If it didn't, be aware that Keras' outputs are slightly random each time, so this is probably why.
The current model still isn't very robust at understanding negation (such as "not happy")
This is because the training set is small and doesn't have a lot of examples of negation.
If the training set were larger, the LSTM model would be much better than the Emojify-V1 model at understanding more complex sentences.
Congratulations!
You've completed this notebook, and harnessed the power of LSTMs to make your words more emotive! โค๏ธโค๏ธโค๏ธ
By now, you've:
Created an embedding matrix
Observed how negative sampling learns word vectors more efficiently than other methods
Experienced the advantages and disadvantages of the GloVe algorithm
And built a sentiment classifier using word embeddings!
Cool! (or Emojified: ๐๐๐ )
What you should remember:
If you have an NLP task where the training set is small, using word embeddings can help your algorithm significantly.
Word embeddings allow your model to work on words in the test set that may not even appear in the training set.
Training sequence models in Keras (and in most other deep learning frameworks) requires a few important details:
To use mini-batches, the sequences need to be padded so that all the examples in a mini-batch have the same length.
An
Embedding()
layer can be initialized with pretrained values.These values can be either fixed or trained further on your dataset.
If however your labeled dataset is small, it's usually not worth trying to train a large pre-trained set of embeddings.
LSTM()
has a flag calledreturn_sequences
to decide if you would like to return every hidden states or only the last one.You can use
Dropout()
right afterLSTM()
to regularize your network.
Input sentences:
Output emojis:
๐๐๐๐๐๐
โ ๐๐ โโ
โ โจ ๐
๐พโจ๐จ ๐ ๐ ๐ข
3 - Acknowledgments
Thanks to Alison Darcy and the Woebot team for their advice on the creation of this assignment.
Woebot is a chatbot friend that is ready to speak with you 24/7.
Part of Woebot's technology uses word embeddings to understand the emotions of what you say.
You can chat with Woebot by going to http://woebot.io