I am currently preparing for an exam on neural networks. In several protocols from former exams I read that the activation functions of neurons (in multilayer perceptrons) have to be monotonic.
I understand that activation functions should be differentiable, have a derivative which is not 0 on most points, and be non-linear. I do not understand why being monotonic is important/helpful.
I know the following activation functions and that they are monotonic:
- ReLU
- Sigmoid
- Tanh
- Softmax: I'm not sure if the definition of monotonicity is applicable for functions $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ with $n, m > 1$
- Softplus
- (Identity)
However, I still can't see any reason why for example $\varphi(x) = x^2$.
Why do activation functions have to be monotonic?
(Related side question: is there any reason why the logarithm/exponential function is not used as an activation function?)