1 minute read

Motivation

Training neural networks, especially in very deep DNNs, can be difficult if the weights aren’t initialized correctly and cause issues such as vanishing (slope becomes zero) or exploding (slope becomes huge) gradients.

Intuition

Imagine a DNN where the # of layers \( l \in [50…200+] \) becomes very large. This means that our hypothesis will be based on many preceding weights \() w^{[l]} \) of those layers and will be conceptually similar to: \[ w^{[l]}w^{[l-1]}w^{[l-2]}w^{[l-3]}…w^{[3]}w^{[2]}w^{[1]}x_n = \hat{y} \] Taking an example of the weight being a little larger than the identity matrix \[ w^{[l]} = \begin{bmatrix} 1.6 & 0 \\ 0 & 1.6 \\ \end{bmatrix} \] We can see how \( \hat{y} \) would become very large and explode

The opposite would happen if our weights were initialized to a relatively small value such as: \[ w^{[l]} = \begin{bmatrix} 0.4 & 0 \\ 0 & 0.4 \\ \end{bmatrix} \] We can see how \( \hat{y} \) would become very small and disappear

Weight Initialization

Picking our initialized values carefully can partially solve this problem. Intuitively, when you have a large number of nodes in a layer, you’ll want each weight term to be smaller since there are more contributions to the next layer

Zero Initialization

Don’t do this! Initializing the weights to all zeros will cause symmetry in your network that will make it perform as if it is a simple logistic regression regardless of the number of layers it has.

Sigmoid Initialization

Set the variance: \[ Var[w_i] = \frac{1}{n} \]

w[l] = np.random.randn(shape) * np.sqrt(1/n[l-1])

ReLU / He Initialization

Set the variance: \[ Var[w_i] = \frac{2}{n} \]

w[l] = np.random.randn(shape) * np.sqrt(2/n[l-1])

Xavier Initialization

For \( \tanh \) set the variance: \[ Var[w_i] = \sqrt{\frac{1}{n^{[l-1]}}} \]