1 minute read

Intro

This quick reference shows the layout of a neural network to help keep the math consistent.

Initialization Parameters

Assume \( n^{[l]} \) is the number of units in layer \( l \) and \( L \) is total number of layers. For a given input \( \mathbf{X} \in \mathbb{R}^{12288 \times 209} \) and \( m = 209 \) training examples then the initialization parameters:

Math

Shape of W Shape of b Activation Shape of Activation
Layer 1 $(n^{[1]},12288)$ $(n^{[1]},1)$ $Z^{[1]} = W^{[1]} X + b^{[1]} $ $(n^{[1]},209)$
Layer 2 $(n^{[2]}, n^{[1]})$ $(n^{[2]},1)$ $Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}$ $(n^{[2]}, 209)$
$\vdots$ $\vdots$ $\vdots$ $\vdots$ $\vdots$
Layer L-1 $(n^{[L-1]}, n^{[L-2]})$ $(n^{[L-1]}, 1)$ $Z^{[L-1]} = W^{[L-1]} A^{[L-2]} + b^{[L-1]}$ $(n^{[L-1]}, 209)$
Layer L $(n^{[L]}, n^{[L-1]})$ $(n^{[L]}, 1)$ $Z^{[L]} = W^{[L]} A^{[L-1]} + b^{[L]}$ $(n^{[L]}, 209)$

Pseudocode

L = len(layer_dims)            # number of layers in the network
    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01
        parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))

Activation

We assume $\sigma$ for this example but replace with any activation function. Some ideas for activation functions are listed
in this post.

\[ A^{[L]} = \sigma(Z^{[L]}) = \sigma(W^{[L]} A^{[L-1]} + b^{[L]})\]

keeping in mind for the last layer often the notional is

\[ A^{[L]} = \hat{Y} \]

Cost

This is cross-entropy cost $J$ but others available \[ \frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)}) \log\left(1- a^{[L] (i)}\right) \]

Backprop

Backprop can be error-prone to implement at times. Use a numeric gradient checking method as a check but make sure to only use the analytic method for performance.

\[ dW^{[l]} = \frac{\partial \mathcal{J} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} \] \[ db^{[l]} = \frac{\partial \mathcal{J} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l] (i)}\] \[ A^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} \]

Outro

Once you pass backprop, it’s relatively smooth sailing through there through the updates so the rest is omitted for brevity.