L-layer Neural Network
Intro
This quick reference shows the layout of a neural network to help keep the math consistent.
Initialization Parameters
Assume \( n^{[l]} \) is the number of units in layer \( l \) and \( L \) is total number of layers. For a given input \( \mathbf{X} \in \mathbb{R}^{12288 \times 209} \) and \( m = 209 \) training examples then the initialization parameters:
Math
Shape of W | Shape of b | Activation | Shape of Activation | |
Layer 1 | $(n^{[1]},12288)$ | $(n^{[1]},1)$ | $Z^{[1]} = W^{[1]} X + b^{[1]} $ | $(n^{[1]},209)$ |
Layer 2 | $(n^{[2]}, n^{[1]})$ | $(n^{[2]},1)$ | $Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}$ | $(n^{[2]}, 209)$ |
$\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
Layer L-1 | $(n^{[L-1]}, n^{[L-2]})$ | $(n^{[L-1]}, 1)$ | $Z^{[L-1]} = W^{[L-1]} A^{[L-2]} + b^{[L-1]}$ | $(n^{[L-1]}, 209)$ |
Layer L | $(n^{[L]}, n^{[L-1]})$ | $(n^{[L]}, 1)$ | $Z^{[L]} = W^{[L]} A^{[L-1]} + b^{[L]}$ | $(n^{[L]}, 209)$ |
Pseudocode
L = len(layer_dims) # number of layers in the network
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01
parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
Activation
We assume $\sigma$ for this example but replace with any activation function. Some ideas for activation functions are listed
in this post.
\[ A^{[L]} = \sigma(Z^{[L]}) = \sigma(W^{[L]} A^{[L-1]} + b^{[L]})\]
keeping in mind for the last layer often the notional is
\[ A^{[L]} = \hat{Y} \]
Cost
This is cross-entropy cost $J$ but others available \[ \frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)}) \log\left(1- a^{[L] (i)}\right) \]
Backprop
Backprop can be error-prone to implement at times. Use a numeric gradient checking method as a check but make sure to only use the analytic method for performance.
\[ dW^{[l]} = \frac{\partial \mathcal{J} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} \] \[ db^{[l]} = \frac{\partial \mathcal{J} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l] (i)}\] \[ A^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} \]
Outro
Once you pass backprop, it’s relatively smooth sailing through there through the updates so the rest is omitted for brevity.