A Detailed Back Propagation Deduction

Outline

There are several ways for calculation of BP for multi-layer perceptron (MLP). First, notice that we want to calculate $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}$ for each layer, which is a scalar-to-matrix derivative. We can use the following methods:

Calculate $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}$ directly with matrix-to-matrix gradients. (not adopted here because matrix-to-matrix gradients are not easy to calculate)
Calculate $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}$ directly and avoid vector-to-matrix gradients. (not adopted here, because it is still hard)
Calculate $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}_{i,j}}$ and then assemble them into a matrix. (adopted here)

For our adopted method, although the result of $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}_{i,j}}$ is enough to determine the whole gradients, and you can write a for-loop to complete the update, it is not efficient. Modern accelerators such as GPUs can fasten the speed for matrix multiplication. We still need to assemble the scalar result into matrix.

We will first see the gradient for one example. As for SGD algorithm, we need a batch of examples, we then extend the result to a batch of examples.

Preliminaries

To make the calculation easier, we will first introduce three points that you should be aware of: denominator layout, multivariable chain rule, and the assembly of a matrix.

First, in deep learning community, the derivate is under the denominator layout by default, which means a scalar-to-matrix gradients will result in a matrix the same shape as the original matrix. If the result is transposed matrix of the original matrix, then it is a denominator layout. You can learn more about denominator layout from Wiki.

Next, we should know the multivariable chain rule. If $x=x(t),y=y(t),z=f(x,y)$, then we habe $\frac{\partial z}{\partial t}=\frac{\partial f}{\partial x}\frac{\partial x}{\partial t}+\frac{\partial f}{\partial y}\frac{\partial y}{\partial t}$. Here is an article on multivariale chain rule. Since we calculate the gradient in a scalar way, a function that accepts a vector or matrix becomes a multivariable function. For example, $f(\mathbf{x})$ becomes $f(\mathbf{x}_0, \mathbf{x}_1, \cdots, \mathbf{x}_n)$. Thus, in the following deduction, we should always use multivariable chain rule.

Finally, we need to know how to assemble a vector or matrix. For matrix $\mathbf{W}$, we can also write it as $\mathbf{W}=[\mathbf{W}_{i,j}]_{i,j}$, where the outer index $i$ and $j$ means that we need to iterate each row and column. Similarly, for a vector, we have $\mathbf{v}=[\mathbf{v}_i]_i$. The assembly can be a reverse process of the matrix multiplication. For example, the multiplication of two column vector $\mathbf{xy}^\top=[\mathbf{x}_i\mathbf{y}_j]_{i,j}$. Thus if we meet a scalar result of $\mathbf{W}_{i,j}=\mathbf{x}_i\mathbf{y}_j$, we can assemble it into the matrix $\mathbf{W}=\mathbf{x}\mathbf{y}^T$.

Definition

scalar: $x, y, \cdots$
vector: $\mathbf{x}, \mathbf{y}, \cdots$
matrix: $\mathbf{X}, \mathbf{Y}, \cdots$
Subscript: $\mathbf{x}_i$ means the $i$th element of $\mathbf{x}$, which is a scalar.
$\mathbf{1}_{ij}$ equals 1 if $i=j$ and 0 otherwise.

Input: $x$ $[n, 1]$
Label: $y$ $[c, 1]$ one-hot
Number of layers: $L$
Number of class: $c$
Linear transformation: $\mathbf{Wx+b}$
Weight at layer $l$: $\mathbf{W}^{(l)}$
Shape of $\mathbf{W}$:

First layer: $\mathbf{W}^{(1)}$ $[m^{(1)}, m^{(0)}=n]$
Hidden layers: (from 2 to $L-1$): $\mathbf{W}^{(l)}$ [$m^{(l)}$, $m^{(l-1)}$]
Last layer: $\mathbf{W}^{(L)}$ $[c, m]$

Activation function: $f$
Activation at layer $l$: $\mathbf{a}^{(l)}$

Input: $\mathbf{a}^{(0)}$ $[n, 1]$
Activations: (from 1 to $L-1$): $\mathbf{a}^{(l)}$ $[m^{(l)}, 1]$
Logits (last layer output): $\mathbf{a}^{(L)}$ $[c, 1]$

Forward Pass

$$\begin{align*} \mathbf{a}^{(0)} &= \mathbf{x} \\ \mathbf{z}^{(1)} &= \mathbf{W}^{(1)}\mathbf{a}^{(0)} + \mathbf{b}^{(1)} \\ \mathbf{a}^{(1)} &= f(\mathbf{z}^{(1)}) \\ \vdots & \quad\vdots\\ \mathbf{z}^{(l)} &= \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)} \\ \mathbf{a}^{(l)} &= f(\mathbf{z}^{(l-1)}) \\ \vdots & \quad\vdots\\ \mathbf{z}^{(L)} &= \mathbf{W}^{(L)}\mathbf{a}^{(L-1)} + \mathbf{b}^{(L)} \\ \mathbf{a}^{(L)} &= \text{softmax}(\mathbf{z}^{(L-1)}) \\ \mathcal{L} &=\text{CE}(\mathbf{a}^{(L)},y) \\ \end{align*} $$

where cross entropy loss $\mathcal{L}=\text{CE}(\mathbf{a}^{(L)},y)=-\sum_{i=1}^c \mathbf{y}_i\log(\mathbf{a}_i^{(L)})$.

We want to calculate the gradient of $L$ with respect to $\mathbf{W}^{(l)}$ and $\mathbf{b}^{(l)}$. Here we only discuss the gradient to $\mathbf{W}^{(l)}$. $\mathbf{b}^{(l)}$ is similar.

The last layer

Since the last layer is different from the other layers, we calculate the gradient for it separately.

In the forward pass, $\mathbf{a}^{(L)}_i=\frac{e^{\mathbf{z}_i}}{\sum_{k=1}^c e^{\mathbf{z}_k}}$ is normalized probability outputed by the softmax layer. An high-school calculation of the gradient of softmax leads to $\frac{\partial \mathbf{a}^{(L)}_i}{\partial \mathbf{z}^{(L)}_j}=\mathbf{a}^{(L)}_i(\mathbf{1}_{ij}-\mathbf{a}^{(L)}_j)$.

To calculate the gradients $\frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(L)}}$, we use the chain rule by introducing $\mathbf{a}$. An important thing here is that when we use the scalar form, most functions we meet here are multivariable functions, so we need to use the multivariable chain rule. For example, $L$ is the result of cross entropy function on $\mathbf{a}_0^{(L)}, \mathbf{a}_1^{(L)}, \dots, \mathbf{a}_{c-1}^{(L)}$. According to the multivariable chain rule, we have $\frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(L)}}=\sum_{i=1}^c\frac{\partial \mathcal{L}}{\partial \mathbf{a}_i^{(L)}}\frac{\partial \mathbf{a}_i^{(L)}}{\partial \mathbf{z}_j^{(L)}}$.

Thus, we have

$$ \begin{align*} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(L)}} &= \sum_{i=1}^c\frac{\partial \mathcal{L}}{\partial \mathbf{a}_i^{(L)}}\frac{\partial \mathbf{a}_i^{(L)}}{\partial \mathbf{z}_j^{(L)}} \\ & =\sum_{i=1}^c(\frac{\mathbf{y}_i}{\mathbf{a}_i^{(L)}})\frac{\partial \mathbf{a}_i^{(L)}}{\partial \mathbf{z}_j^{(L)}} \\ &=-\sum_{i=1}^c\mathbf{y}_i(\mathbf{1}_{ij}-\mathbf{a}^{(L)}_j) \\ &=-\sum_{i=1}^c\mathbf{y}_i\mathbf{1}_{ij}+\mathbf{a}_j^{(L)}\sum_{i=1}^c\mathbf{y}_i \\ &=\mathbf{a}^{(L)}_j-\mathbf{y}_j \end{align*} $$

Then assemble $\frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(L)}}$ into a vector $\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}}=\mathbf{a}^{(L)}-y$. The detailed assembly process is $z=[\frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(L)}}]_{j=1}^{c}=[\mathbf{a}^{(L)}_j-\mathbf{y}_j]_{j=1}^{c}=[\mathbf{a}^{(L)}_j]_{j=1}^{c}-[\mathbf{y}_j]_{j=1}^{c}=\mathbf{a}^{(L)}-y$.

Gradients for weights

We can define $$\mathbf{\delta}^{(l)}=\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}}$$ which is a column vector for the ease of chain rule calclation. For the last layer, $\mathbf{\delta}^{(L)}=\mathbf{a}^{(L)}-y$.

If we can calculate $\mathbf{\delta}^{(l)}$ for all $l$, then we can calculate $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}$ for all $l$. The calculation uses multivariable chain rule. Since we want to use chain rule to connect $L$ and $W$ by $\mathbf{z}$, we need to use multivariable chain rule $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}_{i,j}}=\sum_{k}\frac{\partial \mathcal{L}}{\partial \mathbf{z}_k^{(l)}}\frac{\partial \mathbf{z}_k^{(L)}}{\partial \mathbf{W}^{(L)}_{i,j}}$.

Now let’s recap the matrix multiplication in $\mathbf{z}^{(l)}= \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$, we have $\mathbf{z}^{(l)}_i=\sum_{j=1}^n \mathbf{W}^{(l)}_{i,j}\mathbf{a}^{(l-1)}_j+\mathbf{b}^{(l)}_i$. Then we have $\frac{\partial \mathbf{z}^{(l)}_i}{\partial \mathbf{W}^{(l)}_{i,j}}=\mathbf{a}^{(l-1)}_j$, and $\frac{\partial \mathbf{z}^{(l)}_k}{\partial \mathbf{W}^{(l)}_{i,j}}=0$ for $i\neq k$ (namely, the $W_{ij}^{(L-1)}$ only involves in the calculation of $\mathbf{z}_j^{(L)}$). Thus, we have

$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}_{i,j}}=\frac{\partial \mathcal{L}}{\partial \mathbf{z}_i^{(l)}}\frac{\partial \mathbf{z}_i^{(l)}}{\partial \mathbf{W}^{(l)}_{i,j}}=\mathbf{\delta}_i^{(l)}\mathbf{a}_j^{(l-1)}$$

Now let’s assemble the result. We have $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}=[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}_{i,j}}]_{i,j}=[\mathbf{\delta}_i^{(l)}\mathbf{a}_j^{(l-1)}]_{i,j}=\mathbf{\delta}^{(l)}\mathbf{a}^{(l-1)\top}$.

For the last layer, we have $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(L)}_{i,j}}=(\mathbf{a}^{(L)}_j-\mathbf{y}_j)\mathbf{a}_j^{(L-1)}$, and $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(L)}}=(\mathbf{a}^{(L)}_j-\mathbf{y}_j)\mathbf{a}^{(L-1)\top}$.

Back propagation through layers

Now the only problem left is to calculate $\mathbf{\delta}^{(l)}$ for all $l$. We can use the chain rule to calculate $\mathbf{\delta}^{(l)}$ for all $l$:

$$ \begin{align*} \mathbf{\delta}_j^{(l-1)} &= \frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(l-1)}} \\ &= \sum_{k}\frac{\partial \mathcal{L}}{\partial \mathbf{z}_k^{(l)}}\frac{\partial \mathbf{z}_k^{(l)}}{\partial \mathbf{z}_j^{(l-1)}} \\ &= \sum_{k}\frac{\partial \mathcal{L}}{\partial \mathbf{z}_k^{(l)}}\frac{\partial \mathbf{z}_k^{(l)}}{\partial \mathbf{a}_k^{(l-1)}}\frac{\partial \mathbf{a}_k^{(l-1)}}{\partial \mathbf{z}_j^{(l-1)}} \\ \end{align*} $$

The second step is because $\mathbf{z}_j^{(l)}$ contributes to every $\mathbf{z}_k^{(l)}$ in the linear transformation. The third step is because $\mathbf{z}_j^{(l-1)}$ only contributes to $\mathbf{a}_k^{(l-1)}$ in the nonlinear transformation.

Since $\mathbf{a}_k^{(l-1)}=f(\mathbf{z}_k^{(l-1)})$, we have $\frac{\partial \mathbf{a}_k^{(l-1)}}{\partial \mathbf{z}_k^{(l-1)}}=f’(\mathbf{z}_k^{(l-1)})$. Since $\mathbf{z}^{l}=\mathbf{W}^{l}\mathbf{a}^{l-1}+\mathbf{b}^{l}$, from the matrix multiplication, we have $\frac{\partial \mathbf{z}_k^{(l)}}{\partial \mathbf{a}_k^{(l-1)}}=\mathbf{W}^{l}_{k,j}$. Thus, we have:

$$\mathbf{\delta}_j^{(l-1)}=\sum_{k}\mathbf{\delta}_k^{(l)}\mathbf{W}^{l}_{k,j}f’(\mathbf{z}_j^{(l-1)})$$

Now let’s assemble the result. We have

$$ \begin{align*} \mathbf{\delta}^{(l-1)} &=[\mathbf{\delta}_j^{(l-1)}]_{j=1}^m=[\sum_{k}\mathbf{\delta}_k^{(l)}\mathbf{W}^{l}_{k,j}f’(\mathbf{z}_j^{(l-1)})]_{j=1}^m \\ &=[\sum_{k}\mathbf{\delta}_k^{(l)}\mathbf{W}^{(l)}_{k,j}]_{j=1}^n\odot[f’(\mathbf{z}_j^{(l-1)})]_{j=1}^m \\ &=[W_{\star,j}^{(l)\top} \mathbf{\delta}^{(l)}]_{j=1}^n\odot f’(\mathbf{z}^{(l-1)}) \\ &=\mathbf{W}^{(l)\top}\mathbf{\delta}^{(l)}\odot f’(\mathbf{z}^{(l-1)}) \end{align*} $$

A batch of data

We can extend the above calculation to batch of data. Due to the routine, one example is represented as column vector in the above discussion, while in deep learning programming, we represent example in each row of the matrix. We have $X=[\mathbf{x}_1^\top,\mathbf{x}_2^\top,\cdots,\mathbf{x}_\mathbf{b}^\top]$ (shape $[b, n]$), where $\mathbf{x}_i$ is a column vector. We have $Y=[\mathbf{y}_1,\mathbf{y}_2,\cdots,\mathbf{y}_b]$, where $\mathbf{y}_i$ is a column vector.

Then, the loss is $\mathcal{L}=\frac{1}{b}\sum_{i=1}^b\mathcal{L}(\mathbf{x}_i,\mathbf{y}_i)$. Then vectors $\mathbf{a}^{(l)},\mathbf{z}^{(l)},\mathbf{\delta}^{(l)}$ become matrix $\mathbf{A}^{(l)},\mathbf{Z}^{(l)},\mathbf{\Delta}^{(l)}$ respectively (all of shape $[b, m]$). For the linear transformation, we have

$$\mathbf{Z}^{(l)}=\mathbf{A}^{(l-1)}\mathbf{W}^{(l)\top}+\mathbf{B}^{(l)}$$

where $\mathbf{B}^{(l)}$ is stacked by $\mathbf{b}^{(l)}$. For the nonlinear transformation, we have $\mathbf{A}^{(l)}=f(\mathbf{Z}^{(l)})$. (You can also define the $W$ matrix to be the transpose of the original definition to avoid the transpose in linear transformation).

Note that since the above discussion for one vector is done on a single element, so for the matrix case with a batch of data, the deduction is similar.

Loss and softmax

For the loss and softmax, $\frac{\partial \mathcal{L}}{\mathbf{Z}^{(L)}_{i,j}}=\frac{1}{b}(\mathbf{A}^{(L)}_{i,j}-\mathbf{Y}_{i,j})$. The assembling process is straight forward which leads to $\frac{\partial \mathcal{L}}{\mathbf{Z}^{(L)}}=\frac{1}{b}(\mathbf{A}^{(L)}-\mathbf{Y})$.

Gradients for the weight

For the gradients of weights, the update is equvalent to $\mathbf{Z}^{(l)\top}=\mathbf{W}^{(l)}\mathbf{A}^{(l-1)\top}+\mathbf{B}^{(l)\top}$, which is more similar to the vector form. Since the weight involves in the calculation for each example, using multivariable chain rule, we have

$$\frac{\partial\mathcal{L}}{\partial \mathbf{W}_{i,j}^{(l)}}=\sum_{k=1}^{b}\frac{\partial \mathcal{L}}{\partial \mathbf{Z}_{i,k}^{(l)\top}}\frac{\partial \mathbf{Z}_{i,k}^{(l)\top}}{\partial \mathbf{W}^{(l)\top}_{i,j}}=\sum_{k=1}^b\mathbf{\Delta}_{i,k}^{(l)\top}\mathbf{A}_{j,k}^{(l-1)\top}=\sum_{i=1}^b\mathbf{\Delta}_{i,k}^{(l)\top}\mathbf{A}_{k,j}^{(l-1)}$$

After assembly, we have

$$\frac{\mathcal{L}}{\mathbf{W}^{(l)}}=\mathbf{\Delta}^{(l)\top}\mathbf{A}^{(l-1)}$$

An quick shape check to verify the result. $\frac{\mathcal{L}}{\mathbf{W}^{(l)}}$ should have the same shape as $\mathbf{W}^{(l)}$, which is $[m^{(l)}, m^{(l-1)}]$. $\mathbf{\Delta}^{(l)\top}$ is of shape $[m^{(l)}, b]$, $\mathbf{A}^{(l-1)}$ is of shape $[b, m^{(l-1)}]$. Thus, $\mathbf{\Delta}^{(l)\top}\mathbf{A}^{(l-1)}$ is of shape $[m^{(l)}, m^{(l-1)}]$. This is the same as $\mathbf{W}^{(l)}$.

BP through layers

Finally, for the update from $\mathbf{\Delta}^{(l)}$ to $\mathbf{\Delta}^{(l-1)}$, since each example is independent from each other, it is easy to see that $\mathbf{\Delta}^{(l-1)\top}=\mathbf{W}^{(l)\top}\mathbf{\Delta}^{(l)\top}\odot f’(\mathbf{Z}^{(l-1)\top})$, which means $\mathbf{\Delta}^{(l-1)}=\mathbf{\Delta}^{(l)}\mathbf{W}^{(l)}\odot f’(\mathbf{Z}^{(l-1)})$. This is a direct extension from the vector form.

Implementation

Now we have the derivation of the backpropagation algorithm:

$$\begin{align*} \mathbf{\Delta}^{(L)} & = \frac{1}{b}(\mathbf{A}^{(L)}-Y) \\ \mathbf{\Delta}^{(l)} & = \mathbf{\Delta}^{(l+1)}\mathbf{W}^{(l)}\odot f’(\mathbf{Z}^{(l-1)}) \\ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} & = \mathbf{\Delta}^{(l)\top}\mathbf{A}^{(l-1)} \\ \end{align*}$$

Then we can implement the backward pass easily. The pseudo code for a multilayer perceptron without bias is as follows. Note here the $\mathbf{W}$ matrix strictly follows the definition above, and is consistent with PyTorch nn.Linear.

# With PyTorch style API
# W is a list of weight matrix

def forward\_pass(X, W):
    L = len(W)
    A, Z = [], []
    for l in range(L-1):
        Z[l] = A[l-1] @ W[l].T
        A[l] = f(Z[l])
    Z[L-1] = A[L-2] @ W[L-1]
    A[L-1] = softmax(Z[L-1])
    L = -np.sum(Y * np.log(A[L-1]))
    return L, A, Z

def backward\_pass(Y, A, Z, W)
    L = len(W)
    Delta, W\_g = []
    Delta[L-1] = (A[L-1] - Y) / Y.shape[0]
    for l in range(L-2, 0, -1):
        Delta[l] = Delta[l+1] @ W[l+1] * d\_f(Z[l])
        W\_g[l] = Delta[l+1].T @ A[l-1]
    return W\_g

L, A, Z = forward\_pass(X, W)
W\_g = backward\_pass(Y, A, Z, W) # PyTorch L.backward()