Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/deep-learning-specialization/course-2-deep-neural-network/Week 2 Quiz - Optimization Algorithms.md
Views: 34202
Week 2 Quiz - Optimization Algorithms
1. Which notation would you use to denote the 4th layer's activations when the input is the 7th example from the 3rd mini-batch
2. Which of these statements about mini-batch gradient descent do you agree with?
You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).
Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.
One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.
3. Which of the following is true about batch gradient descent?
It has as many mini-batches as examples in the training set.
It is the same as stochastic gradient descent, but we don't use random elements
It is the same as the mini-batch gradient descent when the mini-batch size is the same as the size of the training set.
4. While using mini-batch gradient descent with a batch size larger than 1 but less than m, the plot of the cost function looks like this:
You notice that the value of is not always decreasing. Which of the following is the most likely reason for that?
You are not implementing the moving averages correctly. Using moving averages will smooth the graph.
The algorithm is on a local minimum thus the noisy behavior.
A bad implementation of the backpropagation process, we should use gradient check to debug our implementation.
In mini-batch gradient descent we calculate thus with each batch we compute over a new set of data.
4-1. Suppose your learning algorithm’s cost JJ, plotted as a function of the number of iterations, looks like this: [same image] Which of the following do you agree with?
Whether you're using batch gradient descent or mini-batch gradient descent, this looks acceptable.
Whether you're using batch gradient descent or mini-batch gradient descent, something is wrong.
If you're using mini-batch gradient descent, something is wrong. But if you're using batch gradient descent, this looks acceptable.
If you're using mini-batch gradient descent, this looks acceptable. But if you're using batch gradient descent, something is wrong.
5. Suppose the temperature in Casablanca over the first two days of March are the following:
March 1st:
March 2nd:
Say you use an exponentially weighted average with to track the temperature: . If is the value computed after day 2 without bias correction, and is the value you compute with bias correction. What are these values?
.
.
.
.
6. Which of the following is true about learning rate decay?
The intuition behind it is that for later epochs our parameters are closer to a minimum thus it is more convenient to take smaller steps to prevent large oscillations.
It helps to reduce the variance of a model.
The intuition behind it is that for later epochs our parameters are closer to a minimum thus it is more convenient to take larger steps to accelerate the convergence.
We use it to increase the size of the steps taken in each mini-batch iteration.
7. You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature: . The yellow and red lines were computed using values and respectively. Which of the following are true?
.
...
8. Which of the following are true about gradient descent with momentum?
It decreases the learning rate as the number of epochs increases.
Gradient descent with momentum makes use of moving averages.
Increasing the hyperparameter smooths out the process of gradient descent.
It generates faster learning by reducing the oscillation of the gradient descent process.
9. Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function . Which of the following techniques could help find parameter values that attain a small value for ? (Check all that apply)
Try using gradient descent with momentum.
Try mini-batch gradient descent.
Try using Adam.
Try initailzing the weight at zero.
Try better random initialization for the weights.
Add more data to the training set.
Normalize the input data.
10. In very high dimensional spaces it is most likely that the gradient descent process gives us a local minimum than a saddle point of the cost function. True/False?
True
False
11. Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.
$
12. Consider this figure:
These plots were generated with gradient descent; with gradient descent with momentum (β = 0.5); and gradient descent with momentum (β = 0.9). Which curve corresponds to which algorithm?
(1) is gradient descent. (2) is gradient descent with momentum (small β). (3) is gradient descent with momentum (large β).
...
12. Which of the following statements about Adam is False?
Adam should be used with batch gradient computation, not with mini-batches.
...