How a regression network is traditionally trained
This network is trained using a data set by adjusting so as to minimize an error function, e.g.,
This objective function is a sum of terms, one for each input/target pair , measuring how close the output is to the target :
This minimization is based on repeated evaluation of the gradient of . This gradient can be efficiently computed using the backpropagation algorithm which uses the chain rule to find the derivatives, as we discuss below.
Often, regularization (also known as weight decay) is included, modifying the objective function to:
where .
Gradient descent
(From Wikipedia) Cool animations at http://www.benfrederickson.com/numerical-optimization/
Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.
Gradient descent is based on the observation that if the multi-variable function is defined and differentiable in a neighborhood of a point , then decreases fastest if one goes from in the direction of the negative gradient of at , . It follows that, if
for small enough, then . In other words, the term is subtracted from because we want to move against the gradient, namely down toward the minimum. With this observation in mind, one starts with a guess for a local minimum of , and considers the sequence such that
We have
, so hopefully the sequence converges to the desired local minimum. Note that the value of the step size is allowed to change at every iteration.
This process is illustrated in the adjacent picture. Here is assumed to be defined on the plane, and that its graph has a bowl shape. The blue curves are the contour lines, that is, the regions on which the value of is constant. A red arrow originating at a point shows the direction of the negative gradient at that point. Note that the (negative) gradient at a point is orthogonal to the contour line going through that point. We see that gradient descent leads us to the bottom of the bowl, that is, to the point where the value of the function is minimal.
#### Illustration of the gradient descept procedure on a series of iterations down a bowl shaped surfaceThe "Zig-Zagging" nature of the method is also evident below, where the gradient descent method is applied to
Stochastic gradient descent (SGD)
Stochastic gradient descent (often shortened to SGD), also known as incremental gradient descent, is a stochastic approximation of the gradient descent optimization and iterative method for minimizing an objective function that is written as a sum of differentiable functions.
There are a number of challenges in applying the gradient descent rule. To understand what the problem is, let's look back at the quadratic cost . Notice that this cost function has the form In practice, to compute the gradient we need to compute the gradients separately for each training input, and then average them. . Unfortunately, when the number of training inputs is very large this can take a long time, and learning thus occurs slowly.
Stochastic gradient descent can be used to speed up learning. The idea is to estimate the gradient by computing for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient.
To connect this explicitly to learning in neural networks, suppose and denote the weights and biases in our neural network. Then stochastic gradient descent works by picking out a randomly chosen mini-batch of training inputs, and training with those,
where the sums are over all the training examples in the current mini-batch. Then we pick out another randomly chosen mini-batch and train with those. And so on, until we have exhausted the training inputs, which is said to complete an epoch of training. At that point we start over with a new training epoch.
The pseudocode would look like:
Choose an initial vector of parameters and learning rate .
Repeat until an approximate minimum is obtained:
Example: linear regression
As seen previously, the objective function to be minimized is:
And the gradent descent equations can be written in matrix form as:
We'll generate a series of 100 random points aligned more or less along the line with and
We now write an SGD code for this problem. The training_data is a list of tuples (x, y) representing the training inputs and corresponding desired outputs. The variables epochs and mini_batch_size are what you'd expect - the number of epochs to train for, and the size of the mini-batches to use when sampling. eta is the learning rate, . If the optional argument test_data is supplied, then the program will evaluate the network after each epoch of training, and print out partial progress. This is useful for tracking progress, but slows things down substantially.
The code works as follows. In each epoch, it starts by randomly shuffling the training data, and then partitions it into mini-batches of the appropriate size. This is an easy way of sampling randomly from the training data. Then for each mini_batch we apply a single step of gradient descent. This is done by the code self.update_mini_batch(mini_batch, eta), which updates the coefficients according to a single iteration of gradient descent, using just the training data in mini_batch.
Challenge 14.2
Use SGD to train the single neuron in the previous notebook using a linearly separable set of 100 points, divided by the line
You will need the following auxiliary functions:
A simple network to classify handwritten digits
Most of this section has been taken from M. Nielsen's free on-line book: "Neural Networks and Deep Learning" http://neuralnetworksanddeeplearning.com/
In this section we discuss a neural network which can solve the more interesting and difficult problem, namely, recognizing individual handwritten digits.
The input layer of the network contains neurons encoding the values of the input pixels. Our training data for the network will consist of many 28 by 28 pixel images of scanned handwritten digits, and so the input layer contains 784=28×28 neurons. The input pixels are greyscale, with a value of 0.0 representing white, a value of 1.0 representing black, and in between values representing gradually darkening shades of grey.
The second layer of the network is a hidden layer. We denote the number of neurons in this hidden layer by , and we'll experiment with different values for . The example shown illustrates a small hidden layer, containing just neurons.
The output layer of the network contains 10 neurons. If the first neuron fires, i.e., has an output , then that will indicate that the network thinks the digit is a 0 . If the second neuron fires then that will indicate that the network thinks the digit is a 1 . And so on. A little more precisely, we number the output neurons from 0 through 9 , and figure out which neuron has the highest activation value. If that neuron is, say, neuron number 6 , then our network will guess that the input digit was a 6 . And so on for the other output neurons.

Network to identify single digits. The output layer has 10 neurons, one for each digit.
The first thing we'll need is a data set to learn from - a so-called training data set. We'll use the MNIST data set, which contains tens of thousands of scanned images of handwritten digits, together with their correct classifications. MNIST's name comes from the fact that it is a modified subset of two data sets collected by NIST, the United States' National Institute of Standards and Technology. Here's a few images from MNIST:

The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images. We'll use the test data to evaluate how well our neural network has learned to recognize digits. To make this a good test of performance, the test data was taken from a different set of 250 people than the original training data (albeit still a group split between Census Bureau employees and high school students). This helps give us confidence that our system can recognize digits from people whose writing it didn't see during training.
In practice, we are going to split the data a little differently. We'll leave the test images as is, but split the 60,000-image MNIST training set into two parts: a set of 50,000 images, which we'll use to train our neural network, and a separate 10,000 image validation set.
We'll use the notation to denote a training input. It'll be convenient to regard each training input as a 28×28=784-dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. We'll denote the corresponding desired output by y=y(x) , where y is a 10 -dimensional vector. For example, if a particular training image, , depicts a 6 , then is the desired output from the network. Note that T here is the transpose operation, turning a row vector into an ordinary (column) vector.
Note also that the biases and weights are stored as lists of Numpy matrices. So, for example net.weights[1] is a Numpy matrix storing the weights connecting the second and third layers of neurons. (It's not the first and second layers, since Python's list indexing starts at 0.) Since net.weights[1] is rather verbose, let's just denote that matrix . It's a matrix such that is the weight for the connection between the neuron in the second layer, and the neuron in the third layer. This ordering of the and indices may seem strange. The big advantage of using this ordering is that it means that the vector of activations of the third layer of neurons is:
There's quite a bit going on in this equation, so let's unpack it piece by piece. is the vector of activations of the second layer of neurons. To obtain we multiply by the weight matrix , and add the vector of biases. We then apply the function sigmoid elementwise to every entry in the vector .
Of course, the main thing we want our Network objects to do is to learn. To that end we'll give them an SGD method which implements stochastic gradient descent.
Most of the work is done by the line
This invokes something called the backpropagation algorithm, which is a fast way of computing the gradient of the cost function. So update_mini_batch works simply by computing these gradients for every training example in the mini_batch, and then updating self.weights and self.biases appropriately.
The activation of the neuron in the layer is related to the activations in the layer by the equation where the sum is over all neurons in the layer. To rewrite this expression in a matrix form we define a weight matrix for each layer, . The entries of the weight matrix are just the weights connecting to the layer of neurons, that is, the entry in the row and column is . Similarly, for each layer we define a bias vector, . You can probably guess how this works - the components of the bias vector are just the values , one component for each neuron in the layer. And finally, we define an activation vector whose components are the activations .
With these notations in mind, these equations can be rewritten in the beautiful and compact vectorized form This expression gives us a much more global way of thinking about how the activations in one layer relate to activations in the previous layer: we just apply the weight matrix to the activations, then add the bias vector, and finally apply the sigmoid function.
Apart from self.backprop the program is self-explanatory - all the heavy lifting is done in self.SGD and self.update_mini_batch, which we've already discussed. The self.backprop method makes use of a few extra functions to help in computing the gradient, namely sigmoid_prime, which computes the derivative of the sigmoid function, and self.cost_derivative. You can get the gist of these (and perhaps the details) just by looking at the code and documentation strings. Note that while the program appears lengthy, much of the code is documentation strings intended to make the code easy to understand. In fact, the program contains just 74 lines of non-whitespace, non-comment code.
We first load the MNIST data:
After loading the MNIST data, we'll set up a Network with 30 hidden neurons.
Finally, we'll use stochastic gradient descent to learn from the MNIST training_data over 30 epochs, with a mini-batch size of 10, and a learning rate of =3.0:
Challenge 14.3
Try creating a network with just two layers - an input and an output layer, no hidden layer - with 784 and 10 neurons, respectively. Train the network using stochastic gradient descent. What classification accuracy can you achieve?
Number of hidden layers
Suppose that we want to approximate a set of functions to a given accuracy. How many hidden layers do we need? The answer is: At most two layers, with arbitrary accuracy obtained given enough units per layer. It has been also shown that only one layer is enough to approximate any continuous function. Of course, there is no way to know how many units we would need, and this is not known in general, and this number may grow exponentially with the number of input units.