We were discussing universal approximation theorems for neural networks and showed that the triangular function
$$ h(x) = \begin{cases} x+1, & x \in [-1,0] \\ 1-x, & x \in [0,1] \\ 0, & \text{otherwise} \end{cases} $$
can be written using the ReLU function $p(x) = \max(0, x)$ as
$$ h(x) = p(x+1) - 2p(x) + p(x-1). $$
The next step was to prove that any continuous function $f:[0,1] \to \mathbb{R}$ can be uniformly approximated by a shallow neural network with ReLU activation. Our instructor just referred to Faber–Schauder approximations and did not give a detailed proof, so I am left a bit confused as for how we would go about showing this.
My questions:
- Why can any piecewise linear (Faber–Schauder-type) approximation be represented as a sum of ReLU functions?
- Why then can any continous function be represented via such a Faber-Schauder approximation?
- Is the restriction to $[0,1]$ essential, or just a convention?
Any rigorous explanation or reference would be greatly appreciated.