Outline

There are several ways for calculation of BP for multi-layer perceptron (MLP). First, notice that we want to calculate LW(l) for each layer, which is a scalar-to-matrix derivative. We can use the following methods:

  • Calculate LW(l) directly with matrix-to-matrix gradients. (not adopted here because matrix-to-matrix gradients are not easy to calculate)
  • Calculate LW(l) directly and avoid vector-to-matrix gradients. (not adopted here, because it is still hard)
  • Calculate LWi,j(l) and then assemble them into a matrix. (adopted here)

For our adopted method, although the result of LWi,j(l) is enough to determine the whole gradients, and you can write a for-loop to complete the update, it is not efficient. Modern accelerators such as GPUs can fasten the speed for matrix multiplication. We still need to assemble the scalar result into matrix.

We will first see the gradient for one example. As for SGD algorithm, we need a batch of examples, we then extend the result to a batch of examples.

Preliminaries

To make the calculation easier, we will first introduce three points that you should be aware of: denominator layout, multivariable chain rule, and the assembly of a matrix.

First, in deep learning community, the derivate is under the denominator layout by default, which means a scalar-to-matrix gradients will result in a matrix the same shape as the original matrix. If the result is transposed matrix of the original matrix, then it is a denominator layout. You can learn more about denominator layout from Wiki.

Next, we should know the multivariable chain rule. If x=x(t),y=y(t),z=f(x,y), then we habe zt=fxxt+fyyt. Here is an article on multivariale chain rule. Since we calculate the gradient in a scalar way, a function that accepts a vector or matrix becomes a multivariable function. For example, f(x) becomes f(x0,x1,,xn). Thus, in the following deduction, we should always use multivariable chain rule.

Finally, we need to know how to assemble a vector or matrix. For matrix W, we can also write it as W=[Wi,j]i,j, where the outer index i and j means that we need to iterate each row and column. Similarly, for a vector, we have v=[vi]i. The assembly can be a reverse process of the matrix multiplication. For example, the multiplication of two column vector xy=[xiyj]i,j. Thus if we meet a scalar result of Wi,j=xiyj, we can assemble it into the matrix W=xyT.

Definition

scalar: x,y,
vector: x,y,
matrix: X,Y,
Subscript: xi means the ith element of x, which is a scalar.
1ij equals 1 if i=j and 0 otherwise.

Input: x [n,1]
Label: y [c,1] one-hot
Number of layers: L
Number of class: c
Linear transformation: Wx+b
Weight at layer l: W(l)
Shape of W:

  • First layer: W(1) [m(1),m(0)=n]
  • Hidden layers: (from 2 to L1): W(l) [m(l), m(l1)]
  • Last layer: W(L) [c,m]

Activation function: f
Activation at layer l: a(l)

  • Input: a(0) [n,1]
  • Activations: (from 1 to L1): a(l) [m(l),1]
  • Logits (last layer output): a(L) [c,1]

Forward Pass

a(0)=xz(1)=W(1)a(0)+b(1)a(1)=f(z(1))z(l)=W(l)a(l1)+b(l)a(l)=f(z(l1))z(L)=W(L)a(L1)+b(L)a(L)=softmax(z(L1))L=CE(a(L),y)

where cross entropy loss L=CE(a(L),y)=i=1cyilog(ai(L)).

We want to calculate the gradient of L with respect to W(l) and b(l). Here we only discuss the gradient to W(l). b(l) is similar.

The last layer

Since the last layer is different from the other layers, we calculate the gradient for it separately.

In the forward pass, ai(L)=ezik=1cezk is normalized probability outputed by the softmax layer. An high-school calculation of the gradient of softmax leads to ai(L)zj(L)=ai(L)(1ijaj(L)).

To calculate the gradients Lzj(L), we use the chain rule by introducing a. An important thing here is that when we use the scalar form, most functions we meet here are multivariable functions, so we need to use the multivariable chain rule. For example, L is the result of cross entropy function on a0(L),a1(L),,ac1(L). According to the multivariable chain rule, we have Lzj(L)=i=1cLai(L)ai(L)zj(L).

Thus, we have

Lzj(L)=i=1cLai(L)ai(L)zj(L)=i=1c(yiai(L))ai(L)zj(L)=i=1cyi(1ijaj(L))=i=1cyi1ij+aj(L)i=1cyi=aj(L)yj

Then assemble Lzj(L) into a vector Lz(L)=a(L)y. The detailed assembly process is z=[Lzj(L)]j=1c=[aj(L)yj]j=1c=[aj(L)]j=1c[yj]j=1c=a(L)y.

Gradients for weights

We can define δ(l)=Lz(l) which is a column vector for the ease of chain rule calclation. For the last layer, δ(L)=a(L)y.

If we can calculate δ(l) for all l, then we can calculate LW(l) for all l. The calculation uses multivariable chain rule. Since we want to use chain rule to connect L and W by z, we need to use multivariable chain rule LWi,j(l)=kLzk(l)zk(L)Wi,j(L).

Now let’s recap the matrix multiplication in z(l)=W(l)a(l1)+b(l), we have zi(l)=j=1nWi,j(l)aj(l1)+bi(l). Then we have zi(l)Wi,j(l)=aj(l1), and zk(l)Wi,j(l)=0 for ik (namely, the Wij(L1) only involves in the calculation of zj(L)). Thus, we have

LWi,j(l)=Lzi(l)zi(l)Wi,j(l)=δi(l)aj(l1)

Now let’s assemble the result. We have LW(l)=[LWi,j(l)]i,j=[δi(l)aj(l1)]i,j=δ(l)a(l1).

For the last layer, we have LWi,j(L)=(aj(L)yj)aj(L1), and LW(L)=(aj(L)yj)a(L1).

Back propagation through layers

Now the only problem left is to calculate δ(l) for all l. We can use the chain rule to calculate δ(l) for all l:

δj(l1)=Lzj(l1)=kLzk(l)zk(l)zj(l1)=kLzk(l)zk(l)ak(l1)ak(l1)zj(l1)

The second step is because zj(l) contributes to every zk(l) in the linear transformation. The third step is because zj(l1) only contributes to ak(l1) in the nonlinear transformation.

Since ak(l1)=f(zk(l1)), we have ak(l1)zk(l1)=f(zk(l1)). Since zl=Wlal1+bl, from the matrix multiplication, we have zk(l)ak(l1)=Wk,jl. Thus, we have:

δj(l1)=kδk(l)Wk,jlf(zj(l1))

Now let’s assemble the result. We have

δ(l1)=[δj(l1)]j=1m=[kδk(l)Wk,jlf(zj(l1))]j=1m=[kδk(l)Wk,j(l)]j=1n[f(zj(l1))]j=1m=[W,j(l)δ(l)]j=1nf(z(l1))=W(l)δ(l)f(z(l1))

A batch of data

We can extend the above calculation to batch of data. Due to the routine, one example is represented as column vector in the above discussion, while in deep learning programming, we represent example in each row of the matrix. We have X=[x1,x2,,xb] (shape [b,n]), where xi is a column vector. We have Y=[y1,y2,,yb], where yi is a column vector.

Then, the loss is L=1bi=1bL(xi,yi). Then vectors a(l),z(l),δ(l) become matrix A(l),Z(l),Δ(l) respectively (all of shape [b,m]). For the linear transformation, we have

Z(l)=A(l1)W(l)+B(l)

where B(l) is stacked by b(l). For the nonlinear transformation, we have A(l)=f(Z(l)). (You can also define the W matrix to be the transpose of the original definition to avoid the transpose in linear transformation).

Note that since the above discussion for one vector is done on a single element, so for the matrix case with a batch of data, the deduction is similar.

Loss and softmax

For the loss and softmax, LZi,j(L)=1b(Ai,j(L)Yi,j). The assembling process is straight forward which leads to LZ(L)=1b(A(L)Y).

Gradients for the weight

For the gradients of weights, the update is equvalent to Z(l)=W(l)A(l1)+B(l), which is more similar to the vector form. Since the weight involves in the calculation for each example, using multivariable chain rule, we have

LWi,j(l)=k=1bLZi,k(l)Zi,k(l)Wi,j(l)=k=1bΔi,k(l)Aj,k(l1)=i=1bΔi,k(l)Ak,j(l1)

After assembly, we have

LW(l)=Δ(l)A(l1)

An quick shape check to verify the result. LW(l) should have the same shape as W(l), which is [m(l),m(l1)]. Δ(l) is of shape [m(l),b], A(l1) is of shape [b,m(l1)]. Thus, Δ(l)A(l1) is of shape [m(l),m(l1)]. This is the same as W(l).

BP through layers

Finally, for the update from Δ(l) to Δ(l1), since each example is independent from each other, it is easy to see that Δ(l1)=W(l)Δ(l)f(Z(l1)), which means Δ(l1)=Δ(l)W(l)f(Z(l1)). This is a direct extension from the vector form.

Implementation

Now we have the derivation of the backpropagation algorithm:

Δ(L)=1b(A(L)Y)Δ(l)=Δ(l+1)W(l)f(Z(l1))LW(l)=Δ(l)A(l1)

Then we can implement the backward pass easily. The pseudo code for a multilayer perceptron without bias is as follows. Note here the W matrix strictly follows the definition above, and is consistent with PyTorch nn.Linear.

# With PyTorch style API
# W is a list of weight matrix

def forward\_pass(X, W):
    L = len(W)
    A, Z = [], []
    for l in range(L-1):
        Z[l] = A[l-1] @ W[l].T
        A[l] = f(Z[l])
    Z[L-1] = A[L-2] @ W[L-1]
    A[L-1] = softmax(Z[L-1])
    L = -np.sum(Y * np.log(A[L-1]))
    return L, A, Z

def backward\_pass(Y, A, Z, W)
    L = len(W)
    Delta, W\_g = []
    Delta[L-1] = (A[L-1] - Y) / Y.shape[0]
    for l in range(L-2, 0, -1):
        Delta[l] = Delta[l+1] @ W[l+1] * d\_f(Z[l])
        W\_g[l] = Delta[l+1].T @ A[l-1]
    return W\_g

L, A, Z = forward\_pass(X, W)
W\_g = backward\_pass(Y, A, Z, W) # PyTorch L.backward()