Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/deep-learning-specialization/course-5-sequence-models/Embedding_plus_Positional_encoding.ipynb
Views: 34208
Transformer Pre-processing
Welcome to Week 4's first ungraded lab. In this notebook you will delve into the pre-processing methods you apply to raw text to before passing it to the encoder and decoder blocks of the transformer architecture.
After this assignment you'll be able to:
Create visualizations to gain intuition on positional encodings
Visualize how positional encodings affect word embeddings
1 - Positional Encoding
Here are the positional encoding equations that you implemented in the previous assignment. This encoding uses the following formulas:
It is a standard practice in natural language processing tasks to convert sentences into tokens before feeding texts into a language model. Each token is then converted into a numerical vector of fixed length called an embedding, which captures the meaning of the words. In the Transformer architecture, a positional encoding vector is added to the embedding to pass positional information throughout the model.
The meaning of these vectors can be difficult to grasp solely by examining the numerical representations, but visualizations can help give some intuition as to the semantic and positional similarity of the words. As you've seen in previous assignments, when embeddings are reduced to two dimensions and plotted, semantically similar words appear closer together, while dissimilar words are plotted farther apart. A similar exercise can be performed with positional encoding vectors - words that are closer in a sentence should appear closer when plotted on a Cartesian plane, and when farther in a sentence, should appear farther on the plane.
In this notebook, you will create a series of visualizations of word embeddings and positional encoding vectors to gain intuition into how positional encodings affect word embeddings and help transport sequential information through the Transformer architecture.
Define the embedding dimension as 100. This value must match the dimensionality of the word embedding. In the "Attention is All You Need" paper, embedding sizes range from 100 to 1024, depending on the task. The authors also use a maximum sequence length ranging from 40 to 512 depending on the task. Define the maximum sequence length to be 100, and the maximum number of words to be 64.
You have already created this visualization in this assignment, but let us dive a little deeper. Notice some interesting properties of the matrix - the first is that the norm of each of the vectors is always a constant. No matter what the value of pos
is, the norm will always be the same value, which in this case is 7.071068. From this property you can conclude that the dot product of two positional encoding vectors is not affected by the scale of the vector, which has important implications for correlation calculations.
Another interesting property is that the norm of the difference between 2 vectors separated by k
positions is also constant. If you keep k
constant and change pos
, the difference will be of approximately the same value. This property is important because it demonstrates that the difference does not depend on the positions of each encoding, but rather the relative seperation between encodings. Being able to express positional encodings as linear functions of one another can help the model to learn by focusing on the relative positions of words.
This reflection of the difference in the positions of words with vector encodings is difficult to achieve, especially given that the values of the vector encodings must remain small enough so that they do not distort the word embeddings.
You have observed some interesting properties about the positional encoding vectors - next, you will create some visualizations to see how these properties affect the relationships between encodings and embeddings!
1.2 - Comparing positional encodings
1.2.1 - Correlation
The positional encoding matrix help to visualize how each vector is unique for every position. However, it is still not clear how these vectors can represent the relative position of the words in a sentence. To illustrate this, you will calculate the correlation between pairs of vectors at every single position. A successful positional encoder will produce a perfectly symmetric matrix in which maximum values are located at the main diagonal - vectors in similar positions should have the highest correlation. Following the same logic, the correlation values should get smaller as they move away from the main diagonal.
1.2.2 - Euclidean distance
You can also use the euclidean distance instead of the correlation for comparing the positional encoding vectors. In this case, your visualization will display a matrix in which the main diagonal is 0, and its off-diagonal values increase as they move away from the main diagonal.
Nice work! You can use these visualizations as checks for any positional encodings you create.
2 - Semantic embedding
You have gained insight into the relationship positional encoding vectors have with other vectors at different positions by creating correlation and distance matrices. Similarly, you can gain a stronger intuition as to how positional encodings affect word embeddings by visualizing the sum of these vectors.
2.1 - Load pretrained embedding
To combine a pretrained word embedding with the positional encodings you created, start by loading one of the pretrained embeddings from the glove project. You will use the embedding with 100 features.
Note: This embedding is composed of 400,000 words and each word embedding has 100 features.
Consider the following text that only contains two sentences. Wait a minute - these sentences have no meaning! Instead, the sentences are engineered such that:
Each sentence is composed of sets of words, which have some semantic similarities among each groups.
In the first sentence similar terms are consecutive, while in the second sentence, the order is random.
First, run the following code cell to apply the tokenization to the raw text. Don't worry too much about what this step does - it will be explained in detail in later ungraded labs. A quick summary (not crucial to understanding the lab):
If you feed an array of plain text of different sentence lengths, and it will produce a matrix with one row for each sentence, each of them represented by an array of size
MAX_SEQUENCE_LENGTH
.Each value in this array represents each word of the sentence using its corresponding index in a dictionary(
word_index
).The sequences shorter than the
MAX_SEQUENCE_LENGTH
are padded with zeros to create uniform length.
Again, this is explained in detail in later ungraded labs, so don't worry about this too much right now!
To simplify your model, you will only need to obtain the embeddings for the different words that appear in the text you are examining. In this case, you will filter out only the 11 words appearing in our sentences. The first vector will be an array of zeros and will codify all the unknown words.
Create an embedding layer using the weights extracted from the pretrained glove embeddings.
Transform the input tokenized data to the embedding using the previous layer. Check the shape of the embedding to make sure the last dimension of this matrix contains the embeddings of the words in the sentence.
Nice! Now you can plot the embedding of each of the sentences. Each plot should disply the embeddings of the different words.
Plot the word of embeddings of the second sentence. Recall that the second sentence contains the same words are the first sentence, just in a different order. You can see that the order of the words does not affect the vector representations.
Wow look at the big difference between the plots! Both plots have changed drastically compared to their original counterparts. Notice that in the second image, which corresponds to the sentence in which similar words are not together, very dissimilar words such as red
and wolf
appear more close.
Now you can try different relative weights and see how this strongly impacts the vector representation of the words in the sentence.
If you set W1 = 1
and W2 = 10
, you can see how arrangement of the words begins to take on a clockwise or anti-clockwise order depending on the position of the words in the sentence. Under these parameters, the positional encoding vectors have dominated the embedding.
Now try inverting the weights to W1 = 10
and W2 = 1
. Observe that under these parameters, the plot resembles the original embedding visualizations and there are only a few changes between the positions of the plotted words.
In the previous Transformer assignment, the word embedding is multiplied by sqrt(EMBEDDING_DIM)
. In this case, it will be equivalent using W1 = sqrt(EMBEDDING_DIM) = 10
and W2 = 1
.
Congratulations!
You've completed this notebook, and have a better sense of the inputs of the Transformer network!
By now, you've:
Created positional encoding matrices to visualize the relational properties of the vectors
Plotted embeddings and positional encodings on a Cartesian plane to observe how they affect each other
What you should remember:
Positional encodings can be expressed as linear functions of each other, which allow the model to learn according to the relative positions of words.
Positional encodings can affect the word embeddings, but if the relative weight of the positional encoding is small, the sum will retain the semantic meaning of the words.