Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/deep-learning-specialization/course-2-deep-neural-network/Optimization_methods.ipynb
Views: 34194
Optimization Methods
Until now, you've always used Gradient Descent to update the parameters and minimize the cost. In this notebook, you'll gain skills with some more advanced optimization methods that can speed up learning and perhaps even get you to a better final value for the cost function. Having a good optimization algorithm can be the difference between waiting days vs. just a few hours to get a good result.
By the end of this notebook, you'll be able to:
Apply optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam
Use random minibatches to accelerate convergence and improve optimization
Gradient descent goes "downhill" on a cost function . Think of it as trying to do this:
At each step of the training, you update your parameters following a certain direction to try to get to the lowest possible point.
Notations: As usual, da
for any variable a
.
Let's get started!
Important Note on Submission to the AutoGrader
Before submitting your assignment to the AutoGrader, please make sure you are not doing the following:
You have not added any extra
print
statement(s) in the assignment.You have not added any extra code cell(s) in the assignment.
You have not changed any of the function parameters.
You are not using any global variables inside your graded exercises. Unless specifically instructed to do so, please refrain from it and use the local variables instead.
You are not changing the assignment code where it is not required, like creating extra variables.
If you do any of the following, you will get something like, Grader not found
(or similarly unexpected) error upon submitting your assignment. Before asking for help/debugging the errors in your assignment, check for these first. If this is the case, and you don't remember the changes you have made, you can get a fresh copy of the assignment by following these instructions.
2 - Gradient Descent
A simple optimization method in machine learning is gradient descent (GD). When you take gradient steps with respect to all examples on each step, it is also called Batch Gradient Descent.
Exercise 1 - update_parameters_with_gd
Implement the gradient descent update rule. The gradient descent rule is, for :
where L is the number of layers and is the learning rate. All parameters should be stored in the parameters
dictionary. Note that the iterator l
starts at 1 in the for
loop as the first parameters are and .
W1 =
[[ 1.63535156 -0.62320365 -0.53718766]
[-1.07799357 0.85639907 -2.29470142]]
b1 =
[[ 1.74604067]
[-0.75184921]]
W2 =
[[ 0.32171798 -0.25467393 1.46902454]
[-2.05617317 -0.31554548 -0.3756023 ]
[ 1.1404819 -1.09976462 -0.1612551 ]]
b2 =
[[-0.88020257]
[ 0.02561572]
[ 0.57539477]]
All test passed
A variant of this is Stochastic Gradient Descent (SGD), which is equivalent to mini-batch gradient descent, where each mini-batch has just 1 example. The update rule that you have just implemented does not change. What changes is that you would be computing gradients on just one training example at a time, rather than on the whole training set. The code examples below illustrate the difference between stochastic gradient descent and (batch) gradient descent.
(Batch) Gradient Descent:
Stochastic Gradient Descent:
In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will "oscillate" toward the minimum rather than converge smoothly. Here's what that looks like:
"+" denotes a minimum of the cost. SGD leads to many oscillations to reach convergence, but each step is a lot faster to compute for SGD than it is for GD, as it uses only one training example (vs. the whole batch for GD).
Note also that implementing SGD requires 3 for-loops in total:
Over the number of iterations
Over the training examples
Over the layers (to update all parameters, from to )
In practice, you'll often get faster results if you don't use the entire training set, or just one training example, to perform each update. Mini-batch gradient descent uses an intermediate number of examples for each step. With mini-batch gradient descent, you loop over the mini-batches instead of looping over individual training examples.
"+" denotes a minimum of the cost. Using mini-batches in your optimization algorithm often leads to faster optimization.
3 - Mini-Batch Gradient Descent
Now you'll build some mini-batches from the training set (X, Y).
There are two steps:
Shuffle: Create a shuffled version of the training set (X, Y) as shown below. Each column of X and Y represents a training example. Note that the random shuffling is done synchronously between X and Y. Such that after the shuffling the column of X is the example corresponding to the label in Y. The shuffling step ensures that examples will be split randomly into different mini-batches.
Partition: Partition the shuffled (X, Y) into mini-batches of size
mini_batch_size
(here 64). Note that the number of training examples is not always divisible bymini_batch_size
. The last mini batch might be smaller, but you don't need to worry about this. When the final mini-batch is smaller than the fullmini_batch_size
, it will look like this:
Exercise 2 - random_mini_batches
Implement random_mini_batches
. The shuffling part has already been coded for you! To help with the partitioning step, you've been provided the following code that selects the indexes for the and mini-batches:
Note that the last mini-batch might end up smaller than mini_batch_size=64
. Let represents rounded down to the nearest integer (this is math.floor(s)
in Python). If the total number of examples is not a multiple of mini_batch_size=64
then there will be mini-batches with a full 64 examples, and the number of examples in the final mini-batch will be .
Hint:
Think of a way in which you can use the for loop variable k
help you increment i
and j
in multiples of mini_batch_size.
As an example, if you want to increment in multiples of 3, you could the following:
All test passed!
shape of the 1st mini_batch_X: (12288, 64)
shape of the 2nd mini_batch_X: (12288, 64)
shape of the 3rd mini_batch_X: (12288, 20)
shape of the 1st mini_batch_Y: (1, 64)
shape of the 2nd mini_batch_Y: (1, 64)
shape of the 3rd mini_batch_Y: (1, 20)
mini batch sanity check: [ 0.90085595 -0.7612069 0.2344157 ]
All tests passed.
What you should remember:
Shuffling and Partitioning are the two steps required to build mini-batches
Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.
4 - Momentum
Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will "oscillate" toward convergence. Using momentum can reduce these oscillations.
Momentum takes into account the past gradients to smooth out the update. The 'direction' of the previous gradients is stored in the variable . Formally, this will be the exponentially weighted average of the gradient on previous steps. You can also think of as the "velocity" of a ball rolling downhill, building up speed (and momentum) according to the direction of the gradient/slope of the hill.
Exercise 3 - initialize_velocity
Initialize the velocity. The velocity, , is a python dictionary that needs to be initialized with arrays of zeros. Its keys are the same as those in the grads
dictionary, that is: for :
Note that the iterator l starts at 1 in the for loop as the first parameters are v["dW1"] and v["db1"] (that's a "one" on the superscript).
v["dW1"] =
[[0. 0. 0.]
[0. 0. 0.]]
v["db1"] =
[[0.]
[0.]]
v["dW2"] =
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
v["db2"] =
[[0.]
[0.]
[0.]]
All tests passed.
Exercise 4 - update_parameters_with_momentum
Now, implement the parameters update with momentum. The momentum update rule is, for :
where L is the number of layers, is the momentum and is the learning rate. All parameters should be stored in the parameters
dictionary. Note that the iterator l
starts at 1 in the for
loop as the first parameters are and (that's a "one" on the superscript).
W1 =
[[ 1.62544598 -0.61290114 -0.52907334]
[-1.07347112 0.86450677 -2.30085497]]
b1 =
[[ 1.74493465]
[-0.76027113]]
W2 =
[[ 0.31930698 -0.24990073 1.4627996 ]
[-2.05974396 -0.32173003 -0.38320915]
[ 1.13444069 -1.0998786 -0.1713109 ]]
b2 =
[[-0.87809283]
[ 0.04055394]
[ 0.58207317]]
v["dW1"] =
[[-0.11006192 0.11447237 0.09015907]
[ 0.05024943 0.09008559 -0.06837279]]
v["db1"] =
[[-0.01228902]
[-0.09357694]]
v["dW2"] =
[[-0.02678881 0.05303555 -0.06916608]
[-0.03967535 -0.06871727 -0.08452056]
[-0.06712461 -0.00126646 -0.11173103]]
v["db2"] = v[[0.02344157]
[0.16598022]
[0.07420442]]
All tests passed.
Note that:
The velocity is initialized with zeros. So the algorithm will take a few iterations to "build up" velocity and start to take bigger steps.
If , then this just becomes standard gradient descent without momentum.
How do you choose ?
The larger the momentum is, the smoother the update, because it takes the past gradients into account more. But if is too big, it could also smooth out the updates too much.
Common values for range from 0.8 to 0.999. If you don't feel inclined to tune this, is often a reasonable default.
Tuning the optimal for your model might require trying several values to see what works best in terms of reducing the value of the cost function .
What you should remember:
Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
You have to tune a momentum hyperparameter and a learning rate .
5 - Adam
Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp (described in lecture) and Momentum.
How does Adam work?
It calculates an exponentially weighted average of past gradients, and stores it in variables (before bias correction) and (with bias correction).
It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables (before bias correction) and (with bias correction).
It updates parameters in a direction based on combining information from "1" and "2".
The update rule is, for :
where:
t counts the number of steps taken of Adam
L is the number of layers
and are hyperparameters that control the two exponentially weighted averages.
is the learning rate
is a very small number to avoid dividing by zero
As usual, all parameters are stored in the parameters
dictionary
v["dW1"] =
[[0. 0. 0.]
[0. 0. 0.]]
v["db1"] =
[[0.]
[0.]]
v["dW2"] =
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
v["db2"] =
[[0.]
[0.]
[0.]]
s["dW1"] =
[[0. 0. 0.]
[0. 0. 0.]]
s["db1"] =
[[0.]
[0.]]
s["dW2"] =
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
s["db2"] =
[[0.]
[0.]
[0.]]
All tests passed.
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-21-fb9df09bb7f1> in <module>
13 print(f"b2 = \n{parameters['b2']}")
14
---> 15 update_parameters_with_adam_test(update_parameters_with_adam)
~/work/release/W2A1/public_tests.py in update_parameters_with_adam_test(target)
257 assert type(parameters[key]) == np.ndarray, f"Wrong type for parameters['{key}']. Expected np.ndarray"
258 assert parameters[key].shape == parametersi[key].shape, f"Wrong shape for parameters['{key}']. The update must keep the dimensions of parameters inputs"
--> 259 assert np.allclose(parameters[key][0], expected_parameters[key]), f"Wrong values. Check you formulas for parameters['{key}']"
260 #print(f"{key}: \n {str(parameters[key])}")
261
AssertionError: Wrong values. Check you formulas for parameters['W1']
Expected values:
You now have three working optimization algorithms (mini-batch gradient descent, Momentum, Adam). Let's implement a model with each of these optimizers and observe the difference.
A 3-layer neural network has already been implemented for you! You'll train it with:
Mini-batch Gradient Descent: it will call your function:
update_parameters_with_gd()
Mini-batch Momentum: it will call your functions:
initialize_velocity()
andupdate_parameters_with_momentum()
Mini-batch Adam: it will call your functions:
initialize_adam()
andupdate_parameters_with_adam()
6.4 - Summary
optimization method | accuracy | cost shape | Gradient descent | >71% | smooth |
Momentum | >71% | smooth |
Adam | >94% | smoother |
Momentum usually helps, but given the small learning rate and the simplistic dataset, its impact is almost negligible.
On the other hand, Adam clearly outperforms mini-batch gradient descent and Momentum. If you run the model for more epochs on this simple dataset, all three methods will lead to very good results. However, you've seen that Adam converges a lot faster.
Some advantages of Adam include:
Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum)
Usually works well even with little tuning of hyperparameters (except )
References:
Adam paper: https://arxiv.org/pdf/1412.6980.pdf
7 - Learning Rate Decay and Scheduling
Lastly, the learning rate is another hyperparameter that can help you speed up learning.
During the first part of training, your model can get away with taking large steps, but over time, using a fixed value for the learning rate alpha can cause your model to get stuck in a wide oscillation that never quite converges. But if you were to slowly reduce your learning rate alpha over time, you could then take smaller, slower steps that bring you closer to the minimum. This is the idea behind learning rate decay.
Learning rate decay can be achieved by using either adaptive methods or pre-defined learning rate schedules.
Now, you'll apply scheduled learning rate decay to a 3-layer neural network in three different optimizer modes and see how each one differs, as well as the effect of scheduling at different epochs.
This model is essentially the same as the one you used before, except in this one you'll be able to include learning rate decay. It includes two new parameters, decay and decay_rate.
Original learning rate: 0.5
Updated learning rate: 0.16666666666666666
All test passed
Notice that if you set the decay to occur at every iteration, the learning rate goes to zero too quickly - even if you start with a higher learning rate.
Epoch Number | Learning Rate | Cost |
0 | 0.100000 | 0.701091 |
1000 | 0.000100 | 0.661884 |
2000 | 0.000050 | 0.658620 |
3000 | 0.000033 | 0.656765 |
4000 | 0.000025 | 0.655486 |
5000 | 0.000020 | 0.654514 |
When you're training for a few epoch this doesn't cause a lot of troubles, but when the number of epochs is large the optimization algorithm will stop updating. One common fix to this issue is to decay the learning rate every few steps. This is called fixed interval scheduling.
7.2 - Fixed Interval Scheduling
You can help prevent the learning rate speeding to zero too quickly by scheduling the exponential learning rate decay at a fixed time interval, for example 1000. You can either number the intervals, or divide the epoch by the time interval, which is the size of window with the constant learning rate.
Exercise 8 - schedule_lr_decay
Calculate the new learning rate using exponential weight decay with fixed interval scheduling.
Instructions: Implement the learning rate scheduling such that it only changes when the epochNum is a multiple of the timeInterval.
Note: The fraction in the denominator uses the floor operation.
Hint: numpy.floor
Original learning rate: 0.5
Updated learning rate after 10 epochs: 0.5
Updated learning rate after 100 epochs: 0.3846153846153846
All test passed
Expected output
7.4 - Achieving similar performance with different methods
With Mini-batch GD or Mini-batch GD with Momentum, the accuracy is significantly lower than Adam, but when learning rate decay is added on top, either can achieve performance at a speed and accuracy score that's similar to Adam.
In the case of Adam, notice that the learning curve achieves a similar accuracy but faster.
optimization method | accuracy | Gradient descent | >94.6% |
Momentum | >95.6% |
Adam | 94% |
Congratulations! You've made it to the end of the Optimization methods notebook. Here's a quick recap of everything you're now able to do:
Apply three different optimization methods to your models
Build mini-batches for your training set
Use learning rate decay scheduling to speed up your training
Great work!