Revisions to How flexible is the link between objective function and output layer activation function?

deleted 143 characters in body

edited Jul 10, 2015 at 18:03

29.5k
5
82
101

It seems standard in many neural network packages to pair up the objective function to be minimised with the activation function in the output layer.

For instance, for a linear output layer used for regression it is standard (and often only choice) to have a squared error objective function. Another usual pairing is logistic output and log loss (or cross-entropy). And yet another is softmax and multi log loss.

Using notation, $z$ for pre-activation value (sum of weights times activations from previous layer), $a$ for activation, $y$ for ground truth used for training, $i$ for index of output neuron.

Linear activation $a_i=z_i$ goes with squared error $\frac{1}{2} \sum\limits_{\forall i} (y_i-a_i)^2$
Sigmoid activation $a_i = \frac{1}{1+e^{-z_i}}$ goes with logloss/cross-entropy objective $-\sum\limits_{\forall i} (y_i*log(a_i) + (1-y_i)*log(1-a_i))$
Softmax activation $a_i = \frac{e^{z_i}}{\sum_{\forall j} e^{z_j}}$ goes with multiclass logloss objective $-\sum\limits_{\forall i} (y_i*log(a_i))$

Those are the ones I know, and I expect there are many that I still haven't heard of.

It seems that log loss would only work and be numerically stable when the output had beand targets are in range [0,1] due to transfer function, and also all training labels have to be y either 0 or 1 exactly due to how the objective function would behave if that wasn't true. So it doesn'tmay not make sense to try linear output layer with a logloss objective function. Unless there is a more general logloss function that can cope with values of $y$ that are neither 0 nor 1outside of the range?

However, it doesn't seem quite so bad to try sigmoid output with a squared error objective. It should be stable and converge at least.

I understand that some of the design behind these pairings is that it makes the formula for $\frac{\delta E}{\delta z}$ - where $E$ is the value of the objective function - easy for back propagation. But it should still be possible to find that derivative using other pairings. Also, there are many other activation functions that are not commonly seen in output layers, but feasibly could be, such as tanh, and where it is not clear what objective function could be applied.

Are there any situations when designing the architecture of a neural network, that you would or should use "non-standard" pairings of output activation and objective functions?

It seems standard in many neural network packages to pair up the objective function to be minimised with the activation function in the output layer.

For instance, for a linear output layer used for regression it is standard (and often only choice) to have a squared error objective function. Another usual pairing is logistic output and log loss (or cross-entropy). And yet another is softmax and multi log loss.

Using notation, $z$ for pre-activation value (sum of weights times activations from previous layer), $a$ for activation, $y$ for ground truth used for training, $i$ for index of output neuron.

Linear activation $a_i=z_i$ goes with squared error $\frac{1}{2} \sum\limits_{\forall i} (y_i-a_i)^2$
Sigmoid activation $a_i = \frac{1}{1+e^{-z_i}}$ goes with logloss/cross-entropy objective $-\sum\limits_{\forall i} (y_i*log(a_i) + (1-y_i)*log(1-a_i))$
Softmax activation $a_i = \frac{e^{z_i}}{\sum_{\forall j} e^{z_j}}$ goes with multiclass logloss objective $-\sum\limits_{\forall i} (y_i*log(a_i))$

Those are the ones I know, and I expect there are many that I still haven't heard of.

It seems that log loss would only work and be numerically stable when the output had be in range [0,1] due to transfer function, and also all training labels have to be y either 0 or 1 exactly due to how the objective function would behave if that wasn't true. So it doesn't make sense to try linear output layer with a logloss objective function. Unless there is a more general logloss function that can cope with values of $y$ that are neither 0 nor 1?

However, it doesn't seem quite so bad to try sigmoid output with a squared error objective. It should be stable and converge at least.

I understand that some of the design behind these pairings is that it makes the formula for $\frac{\delta E}{\delta z}$ - where $E$ is the value of the objective function - easy for back propagation. But it should still be possible to find that derivative using other pairings. Also, there are many other activation functions that are not commonly seen in output layers, but feasibly could be, such as tanh, and where it is not clear what objective function could be applied.

Are there any situations when designing the architecture of a neural network, that you would or should use "non-standard" pairings of output activation and objective functions?

It seems standard in many neural network packages to pair up the objective function to be minimised with the activation function in the output layer.

For instance, for a linear output layer used for regression it is standard (and often only choice) to have a squared error objective function. Another usual pairing is logistic output and log loss (or cross-entropy). And yet another is softmax and multi log loss.

Using notation, $z$ for pre-activation value (sum of weights times activations from previous layer), $a$ for activation, $y$ for ground truth used for training, $i$ for index of output neuron.

Linear activation $a_i=z_i$ goes with squared error $\frac{1}{2} \sum\limits_{\forall i} (y_i-a_i)^2$
Sigmoid activation $a_i = \frac{1}{1+e^{-z_i}}$ goes with logloss/cross-entropy objective $-\sum\limits_{\forall i} (y_i*log(a_i) + (1-y_i)*log(1-a_i))$
Softmax activation $a_i = \frac{e^{z_i}}{\sum_{\forall j} e^{z_j}}$ goes with multiclass logloss objective $-\sum\limits_{\forall i} (y_i*log(a_i))$

Those are the ones I know, and I expect there are many that I still haven't heard of.

It seems that log loss would only work and be numerically stable when the output and targets are in range [0,1]. So it may not make sense to try linear output layer with a logloss objective function. Unless there is a more general logloss function that can cope with values of $y$ that are outside of the range?

However, it doesn't seem quite so bad to try sigmoid output with a squared error objective. It should be stable and converge at least.

I understand that some of the design behind these pairings is that it makes the formula for $\frac{\delta E}{\delta z}$ - where $E$ is the value of the objective function - easy for back propagation. But it should still be possible to find that derivative using other pairings. Also, there are many other activation functions that are not commonly seen in output layers, but feasibly could be, such as tanh, and where it is not clear what objective function could be applied.

Are there any situations when designing the architecture of a neural network, that you would or should use "non-standard" pairings of output activation and objective functions?

Improved LaTex in objective function formulae

Source Link

edited Jul 10, 2015 at 12:10

Neil Slater

29.5k
5
82
101

It seems standard in many neural network packages to pair up the objective function to be minimised with the activation function in the output layer.

For instance, for a linear output layer used for regression it is standard (and often only choice) to have a squared error objective function. Another usual pairing is logistic output and log loss (or cross-entropy). And yet another is softmax and multi log loss.

Using notation, $z$ for pre-activation value (sum of weights times activations from previous layer), $a$ for activation, $y$ for ground truth used for training, $i$ for index of output neuron.

Linear activation $a=z$$a_i=z_i$ goes with squared error $\frac{1}{2} \Sigma (y-a)^2$$\frac{1}{2} \sum\limits_{\forall i} (y_i-a_i)^2$
Sigmoid activation $a = \frac{1}{1+e^{-z}}$$a_i = \frac{1}{1+e^{-z_i}}$ goes with logloss/cross-entropy objective $-\Sigma (y*log(a) + (1-y)*log(1-a))$$-\sum\limits_{\forall i} (y_i*log(a_i) + (1-y_i)*log(1-a_i))$
Softmax activation $a_i = \frac{e^{z_i}}{\Sigma_{j} e^{z_j}}$$a_i = \frac{e^{z_i}}{\sum_{\forall j} e^{z_j}}$ goes with multiclass logloss objective $-\Sigma (y*log(a))$$-\sum\limits_{\forall i} (y_i*log(a_i))$

Those are the ones I know, and I expect there are many that I still haven't heard of - I have taken a few liberties by not showing what the sums are over etc, but hopefully the form of them is still clear and unambiguous given the question context.

It seems that log loss would only work and be numerically stable when the output had be in range [0,1] due to transfer function, and also all training labels have to be y either 0 or 1 exactly due to how the objective function would behave if that wasn't true. So it doesn't make sense to try linear output layer with a logloss objective function. Unless there is a more general logloss function that can cope with values of $y$ that are neither 0 nor 1?

However, it doesn't seem quite so bad to try sigmoid output with a squared error objective. It should be stable and converge at least.

I understand that some of the design behind these pairings is that it makes the formula for $\frac{\delta E}{\delta z}$ - where $E$ is the value of the objective function - easy for back propagation. But it should still be possible to find that derivative using other pairings. Also, there are many other activation functions that are not commonly seen in output layers, but feasibly could be, such as tanh, and where it is not clear what objective function could be applied.

Are there any situations when designing the architecture of a neural network, that you would or should use "non-standard" pairings of output activation and objective functions?

It seems standard in many neural network packages to pair up the objective function to be minimised with the activation function in the output layer.

For instance, for a linear output layer used for regression it is standard (and often only choice) to have a squared error objective function. Another usual pairing is logistic output and log loss (or cross-entropy). And yet another is softmax and multi log loss.

Using notation, $z$ for pre-activation value (sum of weights times activations from previous layer), $a$ for activation, $y$ for ground truth used for training.

Linear activation $a=z$ goes with squared error $\frac{1}{2} \Sigma (y-a)^2$
Sigmoid activation $a = \frac{1}{1+e^{-z}}$ goes with logloss/cross-entropy objective $-\Sigma (y*log(a) + (1-y)*log(1-a))$
Softmax activation $a_i = \frac{e^{z_i}}{\Sigma_{j} e^{z_j}}$ goes with multiclass logloss objective $-\Sigma (y*log(a))$