For a simple skip connection $y = x@w + x$, the gradient dy/dx will be $w+1$.
$$\frac {\partial y}{\partial x} = w +1$$
Is +1 a bit too large and can it overpower $w$ in the backpropagation? Why not $y = x@w + \lambda x$ so that the graident will be $w+\lambda$ ?
Because I think majority (68% for ND) of the the values in $w$ are in-between (-1, 1) by being initialized from normal or uniform distribution and scaled by $\frac {1}{\sqrt(D)}$. $D$ is the dimension or number of features in $x$ e.g. 512 and it is as well the variance of y.
I think $\lambda$ can be learning parameter so that later in the training epochs, the influence of the skiip connection can be reduced?