Understanding Neural Network Backpropagation

backpropagationcomputer sciencemachine learningneural-network

Update: a better formulation of the issue.

I'm trying to understand the backpropagation algorithm with an XOR neural network as an example. For this case there are 2 input neurons + 1 bias, 2 neurons in the hidden layer + 1 bias, and 1 output neuron.

 A   B  A XOR B
 1    1   -1
 1   -1    1
-1    1    1
-1   -1   -1

_{(source: wikimedia.org)}

I'm using stochastic backpropagation.

After reading a bit more I have found out that the error of the output unit is propagated to the hidden layers… initially this was confusing, because when you get to the input layer of the neural network, then each neuron gets an error adjustment from both of the neurons in the hidden layer. In particular, the way the error is distributed is difficult to grasp at first.

Step 1 calculate the output for each instance of input.
Step 2 calculate the error between the output neuron(s) (in our case there is only one) and the target value(s):

Step 3 we use the error from Step 2 to calculate the error for each hidden unit h:

The 'weight kh' is the weight between the hidden unit h and the output unit k, well this is confusing because the input unit does not have a direct weight associated with the output unit. After staring at the formula for a few hours I started to think about what the summation means, and I'm starting to come to the conclusion that each input neuron's weight that connects to the hidden layer neurons is multiplied by the output error and summed up. This is a logical conclusion, but the formula seems a little confusing since it clearly says the 'weight kh' (between the output layer k and hidden layer h).

Am I understanding everything correctly here? Can anybody confirm this?

What's O(h) of the input layer? My understanding is that each input node has two outputs: one that goes into the the first node of the hidden layer and one that goes into the second node hidden layer. Which of the two outputs should be plugged into the O(h)*(1 - O(h)) part of the formula?

Best Answer

The tutorial you posted here is actually doing it wrong. I double checked it against Bishop's two standard books and two of my working implementations. I will point out below where exactly.

An important thing to keep in mind is that you are always searching for derivatives of the error function with respect to a unit or weight. The former are the deltas, the latter is what you use to update your weights.

If you want to understand backpropagation, you have to understand the chain rule. It's all about the chain rule here. If you don't know how it works exactly, check up at wikipedia - it's not that hard. But as soon as you understand the derivations, everything falls into place. Promise! :)

∂E/∂W can be composed into ∂E/∂o ∂o/∂W via the chain rule. ∂o/∂W is easily calculated, since it's just the derivative of the activation/output of a unit with respect to the weights. ∂E/∂o is actually what we call the deltas. (I am assuming that E, o and W are vectors/matrices here)

We do have them for the output units, since that is where we can calculate the error. (Mostly we have an error function that comes down to delta of (t_k - o_k), eg for quadratic error function in the case of linear outputs and cross entropy in case for logistic outputs.)

The question now is, how do we get the derivatives for the internal units? Well, we know that the output of a unit is the sum of all incoming units weighted by their weights and the application of a transfer function afterwards. So o_k = f(sum(w_kj * o_j, for all j)).

So what we do is, derive o_k with respect to o_j. Since delta_j = ∂E/∂o_j = ∂E/∂o_k ∂o_k/∂o_j = delta_k ∂o_k/o_j. So given delta_k, we can calculate delta_j!

Let's do this. o_k = f(sum(w_kj * o_j, for all j)) => ∂o_k/∂o_j = f'(sum(w_kj * o_j, for all j)) * w_kj = f'(z_k) * w_kj.

For the case of the sigmoidal transfer function, this becomes z_k(1 - z_k) * w_kj. (Here is the error in the tutorial, the author says o_k(1 - o_k) * w_kj!)

Related Solutions

How to update the bias in neural network backpropagation

Following the notation of Rojas 1996, chapter 7, backpropagation computes partial derivatives of the error function E (aka cost, aka loss)

∂E/∂w[i,j] = delta[j] * o[i]

where w[i,j] is the weight of the connection between neurons i and j, j being one layer higher in the network than i, and o[i] is the output (activation) of i (in the case of the "input layer", that's just the value of feature i in the training sample under consideration). How to determine delta is given in any textbook and depends on the activation function, so I won't repeat it here.

These values can then be used in weight updates, e.g.

// update rule for vanilla online gradient descent
w[i,j] -= gamma * o[i] * delta[j]

where gamma is the learning rate.

The rule for bias weights is very similar, except that there's no input from a previous layer. Instead, bias is (conceptually) caused by input from a neuron with a fixed activation of 1. So, the update rule for bias weights is

bias[j] -= gamma_bias * 1 * delta[j]

where bias[j] is the weight of the bias on neuron j, the multiplication with 1 can obviously be omitted, and gamma_bias may be set to gamma or to a different value. If I recall correctly, lower values are preferred, though I'm not sure about the theoretical justification of that.

Neural network backpropagation with RELU

if x <= 0, output is 0. if x > 0, output is 1

The ReLU function is defined as: For x > 0 the output is x, i.e. f(x) = max(0,x)

So for the derivative f '(x) it's actually:

if x < 0, output is 0. if x > 0, output is 1.

The derivative f '(0) is not defined. So it's usually set to 0 or you modify the activation function to be f(x) = max(e,x) for a small e.

Generally: A ReLU is a unit that uses the rectifier activation function. That means it works exactly like any other hidden layer but except tanh(x), sigmoid(x) or whatever activation you use, you'll instead use f(x) = max(0,x).

If you have written code for a working multilayer network with sigmoid activation it's literally 1 line of change. Nothing about forward- or back-propagation changes algorithmically. If you haven't got the simpler model working yet, go back and start with that first. Otherwise your question isn't really about ReLUs but about implementing a NN as a whole.

Best Answer

Related Solutions

How to update the bias in neural network backpropagation

Neural network backpropagation with RELU

Related Topic