Why no scale parameter for skip connection addition?

Question

For a simple skip connection $y = x@w + x$, the gradient dy/dx will be $w+1$.

$$\frac {\partial y}{\partial x} = w +1$$

Is +1 a bit too large and can it overpower $w$ in the backpropagation? Why not $y = x@w + \lambda x$ so that the graident will be $w+\lambda$ ?

Because I think majority (68% for ND) of the the values in $w$ are in-between (-1, 1) by being initialized from normal or uniform distribution and scaled by $\frac {1}{\sqrt(D)}$. $D$ is the dimension or number of features in $x$ e.g. 512 and it is as well the variance of y.

I think $\lambda$ can be learning parameter so that later in the training epochs, the influence of the skiip connection can be reduced?

Karl · Accepted Answer · 2024-02-18 07:31:53Z

Scaling the identity term defeats the purpose. The point is numerical stability.

When training deep models, we are not simply computing a gradient. We are using the chain rule to backpropagate gradients into the model. This requires multiplying gradients together. ie:

$$ u = f(x) \\ y = g(u) \\ \frac{\partial y}{\partial u} = g'(u) \\ \frac{\partial u}{\partial x} = f'(x) \\ \frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \frac{\partial u}{\partial x} = g'(u) * f'(x) $$

Typically our gradients $g'(u)$ and $f'(x)$ are small numbers, so $g'(u) * f'(x)$ will be an even smaller number. After a few layers of chain multiplying, the gradient product values get extremely small and vanish.

The point of adding the skip connection is to preserve gradient magnitude as we backprop and prevent vanishing.

$$ u = f(x) + x \\ y = g(u) + u \\ \frac{\partial y}{\partial u} = g'(u) + 1 \\ \frac{\partial u}{\partial x} = f'(x) + 1 \\ \frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \frac{\partial u}{\partial x} = (g'(u) + 1) * (f'(x) + 1) $$

The skip connection transforms $g'(u) * f'(x)$ into $(g'(u) + 1) * (f'(x) + 1)$ which is much more numerically stable over multiple layers of backprop. The gradient magnitude is always ~1, so the product of gradients is similarly ~1.

Stack Exchange Network

Why no scale parameter for skip connection addition?

1 Answer 1

Hot Network Questions

Why no scale parameter for skip connection addition?

1 Answer 1

Related

Hot Network Questions