CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
leechanwoo-kor

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: leechanwoo-kor/coursera
Path: blob/main/deep-learning-specialization/course-5-sequence-models/Week 3 Quiz - Sequence Models & Attention Mechanism.md
Views: 34209

Week 3 Quiz - Sequence Models & Attention Mechanism


1. Consider using this encoder-decoder model for machine translation.

image

This model is a “conditional language model” in the sense that the encoder portion (shown in green) is modeling the probability of the input sentence xx.

  • True

  • False


2. In beam search, if you increase the beam width B, which of the following would you expect to be true?

  • Beam search will use up less memory.

  • Beam search will generally find better solutions (i.e. do a better job maximizing P(yx)P(y \mid x)).

  • Beam search will run more quickly.

  • Beam search will converge after fewer steps.

📌 As the beam width increases, beam search runs more slowly, uses up more memory, and converges after more steps, but generally finds better solutions.


3. In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.

  • True

  • False


4. Suppose you are building a speech recognition system, which uses an RNN model to map from audio clip xx to a text transcript yy. Your algorithm uses beam search to try to find the value of y that maximizes P(y \mid x).

On a dev set example, given an input audio clip, your algorithm outputs the transcript y^\hat{y} = = “I’m building an A Eye system in Silly con Valley.”, whereas a human gives a much superior transcript yy^{ * } = “I’m building an AI system in Silicon Valley.”

According to your model,

P(y^x)=7.21108P(\hat{y} \mid x) = 7.21 * 10^{-8}


P(yx)=1.09107P(y^{ * } \mid x) = 1.09 * 10^{-7}

Would you expect increasing the beam width BB to help correct this example?

  • No, because P(yx)>P(y^x)P(y^{ * } | x) > P(\hat{y} \mid x) indicates the error should be attributed to the search algorithm rather than the RNN.

  • No, because P(yx)>P(y^x)P(y^{ * } | x) > P(\hat{y} \mid x) indicates the error should be attributed to the RNN rather than to the search algorithm.

  • Yes, because P(yx)>P(y^x)P(y^{ * } | x) > P(\hat{y} \mid x) indicates the error should be attributed to the search algorithm rather than the RNN.

  • Yes, because P(yx)>P(y^x)P(y^{ * } | x) > P(\hat{y} \mid x) indicates the error should be attributed to the RNN rather than to the search algorithm.

📌 P(yx)>P(y^x)P(y^{ * } \mid x) > P(\hat{y} \mid x) indicates the error should be attributed to the search algorithm rather than to the RNN. Increasing the beam width will generally allow beam search to find better solutions.


5. Continuing the example from Q4, suppose you work on your algorithm for a few more weeks, and now find that for the vast majority of examples on which your algorithm makes a mistake, P(yx)>P(y^x)P(y^{ * } \mid x) > P(\hat{y} \mid x). This suggests you should not focus your attention on improving the search algorithm.

  • True

  • False

📌 P(yx)>P(y^x)P(y^{ * } \mid x) > P(\hat{y} \mid x) indicates the error should be attributed to the search algorithm rather than to the RNN.


6. Consider the attention model for machine translation.

image

Further, here is the formula for α<t,t>\alpha^{< t,t’>}.

image

Which of the following statements about α<t,t>\alpha^{< t,t’>} are true? Check all that apply.

  • α<t,t>\alpha^{< t,t’>} is equal to the amount of attention y<t>y^{< t>} should pay to α<t>\alpha^{< t'>}

  • We expect α<t,t>\alpha^{< t,t’>} to be generally larger for values of α<t>\alpha^{< t’>} that are highly relevant to the value the network should output for y<t>y^{< t'>}. (Note the indices in the superscripts.)

  • tα<t,t>=0\sum\limits_{t'} \alpha^{< t,t'>} = 0. (Note the summation is over t'.)

  • tα<t,t>=1\sum\limits_{t'} \alpha^{< t,t'>} = 1. (Note the summation is over t'.)


7. The network learns where to “pay attention” by learning the values $e^{< t,t’>}, which are computed using a small neural network:

We can't replace s<t1>s^{< t-1>} with s^{< t>} as an input to this neural network. This is because s<t>s^{< t>} depends on α<t,t>\alpha^{< t,t’>} which in turn depends on e<t,t>e^{< t,t’>} so at the time we need to evaluate this network, we haven’t computed s<t>s^{< t>} yet.

  • True

  • False


8. Compared to the encoder-decoder model shown in Question 1 of this quiz (which does not use an attention mechanism), we expect the attention model to have the greatest advantage when:

  • The input sequence length TxT_x is small.

  • The input sequence length TxT_x is large.


9. Under the CTC model, identical repeated characters not separated by the “blank” character ( _ ) are collapsed. Under the CTC model, what does the following string collapse to?

__c_oo_o_kk___b_ooooo__oo__kkk

  • cook book

  • coookkboooooookkk

  • cokbok

  • cookbook


10. In trigger word detection, x<t>x^{< t>} is:

  • The ttht^{-th} input word, represented as either a one-hot vector or a word embedding.

  • Features of the audio (such as spectrogram features) at time tt.

  • Whether the trigger word is being said at time tt.

  • Whether someone has just finished saying the trigger word at time tt.