Initial weight selection

From stats++ wiki
Jump to: navigation, search

The initial value of the network weights can have a signifiant impact on their optimization via learning algorithms where the error is backpropagated. Considering a symmetric sigmoid, for example (the extension to non-symmetric sigmoids is straightforward), the error surface is very flat near the origin; hence, very small weights will result in small gradients. Far from the origin, the sigmoid saturates and its derivative goes to zero (i.e., the error surface is again flat).

Theory

For a symmetric sigmoid, it is easy to see that the initial weights should be chosen from a distribution with zero mean. In this case, the expected total input to a node is \begin{equation} \tag{1} \operatorname{E} \left[\sum_{i=1}^n w_i x_i \right] = 0 \end{equation} which is desirable, because:

  • The derivative of a symmetric sigmoid reaches its maximum value at 0, leading to a greater backpropagation error
  • The network will learn the linear part of the mapping, before the non-linear part

The question then becomes: what distribution should the weights be chosen from?

Consider the variance of the total input to a node: \[ \begin{align} \operatorname{Var} \left( \sum_{i=1}^n w_i x_i \right) &= \operatorname{E}\left[\left(\sum_{i=1}^n w_i x_i\right)^2 \right] - \left( \operatorname{E} \left[\sum_{i=1}^n w_i x_i \right] \right)^2 \\ &=\operatorname{E}\left[\left(\sum_{i=1}^n w_i x_i\right)^2 \right] \end{align} \] where, in the second line, the zero mean property from Eq. (1) has been considered. Further, since \(w\) and \(x\) are uncorrelated \begin{equation} \tag{2} \operatorname{Var}\left(\sum_{i=1}^n w_i x_i\right) = \sum_{i=1}^n \operatorname{E} \left[ w_i^2 \right] \operatorname{E} \left[ x_i^2 \right] \end{equation}

The optimal weight selection now depends on three factors:

  1. The variance of the total input to a node
  2. The distribution of the input $X$
  3. The distribution that we wish to draw \(w_i\) from

On the one hand, the variance of the total input to a node should be made large. This is because the greater variance, the greater the chance that each hidden node has in separating from its neighbors (i.e., recognizing different features of the input), during training. On the other, with a wider variance results in a lower expected value of the sigmoid's derivative. In the end, the optimization of the weights turns out not to be very sensitive to this[1]. A convenient choice is $\operatorname{Var} \left( \sum_{i=1}^n w_i x_i \right) = 1$, because for a sigmoid with an effective gain of 1 over its useful range, the variance of the output will be approximately the same as that of the inputs[2]. This is useful with multiple layers of hidden nodes.

The distribution of the input $X$ is handled during preprocessing. For inputs normalized by their covariance, $\operatorname{E} \left[ x_i^2 \right] = 1$.

Based on the above, Eq. (2) simplifies: \begin{equation} \operatorname{E} \left[ w_i^2 \right] = \frac{1}{n} \end{equation} and we are left to choose the distribution which to draw each $w_i$ from. For uniform and normal distributions: \begin{equation} \operatorname{E} \left[ w_i^2 \right] = \begin{cases} \frac{1}{3} \alpha^2 & \mathrm{if}\ w_i \sim \mathcal{U}\left( -\alpha, \alpha \right) \\ \sigma^2 & \mathrm{if}\ w_i \sim \mathcal{N}\left( 0, \sigma^2 \right) \end{cases} \end{equation} Therefore: \begin{equation} \begin{aligned} \alpha &= \sqrt{\frac{3}{n}} && \mathrm{if}\ w_i \sim \mathcal{U}\left( -\alpha, \alpha \right) \\ \sigma^2 &= \frac{1}{n} && \mathrm{if}\ w_i \sim \mathcal{N}\left( 0, \sigma^2 \right) \end{aligned} \end{equation}

analytics++

Based on the above discussion, the optimum

Currently these settings cannot be overridden.

References and notes

  1. R. Rojas, Neural Networks: A Systematic Introduction (Springer-Verlag: Berlin 1996)
  2. Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, "Efficient BackProp", In: G. B. Orr and K.-R. Müller (eds), Neural Networks: Tricks of the trade (Springer-Verlag: Berlin 1998)