Outline
There are several ways for calculation of BP for multi-layer perceptron (MLP). First, notice that we want to calculate
- Calculate
directly with matrix-to-matrix gradients. (not adopted here because matrix-to-matrix gradients are not easy to calculate) - Calculate
directly and avoid vector-to-matrix gradients. (not adopted here, because it is still hard) - Calculate
and then assemble them into a matrix. (adopted here)
For our adopted method, although the result of
We will first see the gradient for one example. As for SGD algorithm, we need a batch of examples, we then extend the result to a batch of examples.
Preliminaries
To make the calculation easier, we will first introduce three points that you should be aware of: denominator layout, multivariable chain rule, and the assembly of a matrix.
First, in deep learning community, the derivate is under the denominator layout by default, which means a scalar-to-matrix gradients will result in a matrix the same shape as the original matrix. If the result is transposed matrix of the original matrix, then it is a denominator layout. You can learn more about denominator layout from Wiki.
Next, we should know the multivariable chain rule. If
Finally, we need to know how to assemble a vector or matrix. For matrix
Definition
scalar:
vector:
matrix:
Subscript:
Input:
Label:
Number of layers:
Number of class:
Linear transformation:
Weight at layer
Shape of
- First layer:
- Hidden layers: (from 2 to
): [ , ] - Last layer:
Activation function:
Activation at layer
- Input:
- Activations: (from 1 to
): - Logits (last layer output):
Forward Pass
where cross entropy loss
We want to calculate the gradient of
The last layer
Since the last layer is different from the other layers, we calculate the gradient for it separately.
In the forward pass,
To calculate the gradients
Thus, we have
Then assemble
Gradients for weights
We can define
If we can calculate
Now let’s recap the matrix multiplication in
Now let’s assemble the result. We have
For the last layer, we have
Back propagation through layers
Now the only problem left is to calculate
The second step is because
Since
Now let’s assemble the result. We have
A batch of data
We can extend the above calculation to batch of data. Due to the routine, one example is represented as column vector in the above discussion, while in deep learning programming, we represent example in each row of the matrix. We have
Then, the loss is
where
Note that since the above discussion for one vector is done on a single element, so for the matrix case with a batch of data, the deduction is similar.
Loss and softmax
For the loss and softmax,
Gradients for the weight
For the gradients of weights, the update is equvalent to
After assembly, we have
An quick shape check to verify the result.
BP through layers
Finally, for the update from
Implementation
Now we have the derivation of the backpropagation algorithm:
Then we can implement the backward pass easily. The pseudo code for a multilayer perceptron without bias is as follows. Note here the
# With PyTorch style API
# W is a list of weight matrix
def forward\_pass(X, W):
L = len(W)
A, Z = [], []
for l in range(L-1):
Z[l] = A[l-1] @ W[l].T
A[l] = f(Z[l])
Z[L-1] = A[L-2] @ W[L-1]
A[L-1] = softmax(Z[L-1])
L = -np.sum(Y * np.log(A[L-1]))
return L, A, Z
def backward\_pass(Y, A, Z, W)
L = len(W)
Delta, W\_g = []
Delta[L-1] = (A[L-1] - Y) / Y.shape[0]
for l in range(L-2, 0, -1):
Delta[l] = Delta[l+1] @ W[l+1] * d\_f(Z[l])
W\_g[l] = Delta[l+1].T @ A[l-1]
return W\_g
L, A, Z = forward\_pass(X, W)
W\_g = backward\_pass(Y, A, Z, W) # PyTorch L.backward()