The tutorial you posted here is actually doing it wrong. I double checked it against Bishop's two standard books and two of my working implementations. I will point out below where exactly.
An important thing to keep in mind is that you are always searching for derivatives of the error function with respect to a unit or weight. The former are the deltas, the latter is what you use to update your weights.
If you want to understand backpropagation, you have to understand the chain rule. It's all about the chain rule here. If you don't know how it works exactly, check up at wikipedia - it's not that hard. But as soon as you understand the derivations, everything falls into place. Promise! :)
∂E/∂W can be composed into ∂E/∂o ∂o/∂W via the chain rule. ∂o/∂W is easily calculated, since it's just the derivative of the activation/output of a unit with respect to the weights. ∂E/∂o is actually what we call the deltas. (I am assuming that E, o and W are vectors/matrices here)
We do have them for the output units, since that is where we can calculate the error. (Mostly we have an error function that comes down to delta of (t_k - o_k), eg for quadratic error function in the case of linear outputs and cross entropy in case for logistic outputs.)
The question now is, how do we get the derivatives for the internal units? Well, we know that the output of a unit is the sum of all incoming units weighted by their weights and the application of a transfer function afterwards. So o_k = f(sum(w_kj * o_j, for all j)).
So what we do is, derive o_k with respect to o_j. Since delta_j = ∂E/∂o_j = ∂E/∂o_k ∂o_k/∂o_j = delta_k ∂o_k/o_j. So given delta_k, we can calculate delta_j!
Let's do this. o_k = f(sum(w_kj * o_j, for all j)) => ∂o_k/∂o_j = f'(sum(w_kj * o_j, for all j)) * w_kj = f'(z_k) * w_kj.
For the case of the sigmoidal transfer function, this becomes z_k(1 - z_k) * w_kj. (Here is the error in the tutorial, the author says o_k(1 - o_k) * w_kj!)
I think the "over all patterns" simply means "in every iteration"... take a look at the RPROP paper
For the paritial derivative: you've already implemented the normal back-propagation algorithm. This is a method for efficiently calculate the gradient... there you calculate the δ values for the single neurons, which are in fact the negative ∂E/∂w values, i.e. the parital derivative of the global error as function of the weights.
so instead of multiplying the weights with these values, you take one of two constants (η+ or η-), depending on whether the sign has changed
Best Answer
Following the notation of Rojas 1996, chapter 7, backpropagation computes partial derivatives of the error function
E
(aka cost, aka loss)where
w[i,j]
is the weight of the connection between neuronsi
andj
,j
being one layer higher in the network thani
, ando[i]
is the output (activation) ofi
(in the case of the "input layer", that's just the value of featurei
in the training sample under consideration). How to determinedelta
is given in any textbook and depends on the activation function, so I won't repeat it here.These values can then be used in weight updates, e.g.
where
gamma
is the learning rate.The rule for bias weights is very similar, except that there's no input from a previous layer. Instead, bias is (conceptually) caused by input from a neuron with a fixed activation of 1. So, the update rule for bias weights is
where
bias[j]
is the weight of the bias on neuronj
, the multiplication with 1 can obviously be omitted, andgamma_bias
may be set togamma
or to a different value. If I recall correctly, lower values are preferred, though I'm not sure about the theoretical justification of that.