CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
leechanwoo-kor

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: leechanwoo-kor/coursera
Path: blob/main/deep-learning-specialization/course-5-sequence-models/Week 4 Quiz - Transformers.md
Views: 34210

Week 4 Quiz - Transformers


1. A Transformer Network, like its predecessors RNNs, GRUs and LSTMs, can process information one word at a time. (Sequential architecture).

  • True

  • False

πŸ“Œ A Transformer Network can ingest entire sentences all at the same time.

1-1. A Transformer Network processes sentences from left to right, one word at a time.

  • True

  • False


2. The major innovation of the transformer architecture is combining the use of LSTMs and RNN sequential processing.

  • True

  • False

πŸ“Œ The major innovation of the transformer architecture is combining the use of attention based representations and a CNN convolutional neural network style of processing.

2-1. Transformer Network methodology is taken from:

  • RNN and LSTMs

  • Attention Mechanism and RNN style of processing.

  • Attention Mechanism and CNN style of processing.

  • GRUs and LSTMs

πŸ“Œ Transformer architecture combines the use of attention based representations and a CNN convolutional neural network style of processing.


3. What are the key inputs to computing the attention value for each word?

image

  • The key inputs to computing the attention value for each word are called the query, key, and value.

  • ...

πŸ“Œ The key inputs to computing the attention value for each word are called the query, key, and value.


4. Which of the following correctly represents Attention ?

  • Attention(Q,K,V)=softmax(QKTdk)VAttention(Q,K,V)=softmax(\dfrac{QK^T}{\sqrt{d_k}})V

  • Attention(Q,K,V)=softmax(QVTdk)KAttention(Q,K,V)=softmax(\dfrac{QV^T}{\sqrt{d_k}})K

  • Attention(Q,K,V)=min(QKTdk)VAttention(Q,K,V)=min(\dfrac{QK^T}{\sqrt{d_k}})V

  • Attention(Q,K,V)=min(QVTdk)KAttention(Q,K,V)=min(\dfrac{QV^T}{\sqrt{d_k}})K


5. Are the following statements true regarding Query (Q), Key (K) and Value (V)?

Q = interesting questions about the words in a sentence
K = qualities of words given a Q
V = specific representations of words given a Q

  • True

  • False


6. Attention(WiQQ,WiKK,WiVV)Attention(W_i^QQ,W_i^KK,W_i^VV)

ii here represents the computed attention weight matrix associated with the ithith β€œword” in a sentence.

  • True

  • False

πŸ“Œ ii here represents the computed attention weight matrix associated with the ithith β€œhead” (sequence).


7. Following is the architecture within a Transformer Network (without displaying positional encoding and output layers(s)).

image

What is NOT necessary for the Decoder’s second block of Multi-Head Attention?

  • K

  • Q

  • All of the above are necessary for the Decoder's second block.

  • V


8. Following is the architecture within a Transformer Network (without displaying positional encoding and output layers(s)).

image

The output of the decoder block contains a softmax layer followed by a linear layer to predict the next word one word at a time.

  • True

  • False


9. Which of the following statements is true?

  • The transformer network is similar to the attention model in that neither contain positional encoding.

  • The transformer network is similar to the attention model in that both contain positional encoding.

  • The transformer network differs from the attention model in that only the transformer network contains positional encoding.

  • The transformer network differs from the attention model in that only the attention model contains positional encoding.

πŸ“Œ Positional encoding allows the transformer network to offer an additional benefit over the attention model.


10. Which of these is not a good criterion for a good positional encoding algorithm?

  • It should output a common encoding for each time-step (word's position in a sentence).

  • Destance between any two time-steps should be consistent for all sentence lengths.

  • It must be deterministic.

  • The algorithm should be able to generalize to longer sentences.

πŸ“Œ This is not a good criterion for a good positional encoding algorithm.


11. Which of the following statements is true about positional encoding? Select all that apply.

  • Positional encoding is used in the transformer network and the attention model.

  • Positional encoding provides extra information to our model.

  • Positional encoding uses a combination of sine and cosine equations.

  • Positional encoding is important because position and word order are essential in sentence construction of any language.