Questions about Neural Network training (back propagation) in the book PRML (Pattern Recognition and Machine Learning)

Question

I am reading Chapter 5 of PRML. Some symbols don't seem to be clear to me.

In page 243, for the chain rule for partial derivative $\dfrac{\partial E_n}{\partial w_{ji}}=\dfrac{\partial E_n}{\partial a_j}\dfrac{\partial a_j}{\partial w_{ji}}$ (equation (5.50)), a notation is defined as $\delta_j\equiv\dfrac{\partial E_n}{\partial a_j}$ (equation (5.51)). In my understanding $\delta_j$ is the first part in the chain rule. However in equation (5.54), the book mentioned

As we have seen already, for the output units, we have $$\delta_k=y_k-t_k$$

Question 1: $y_k-t_k$ is the error on output unit $k$, which is simply the difference between the $k$th output unit value and the corresponding target value. But, from the definition of the notation $\delta_k$, we should have $$\delta_k=\dfrac{\partial \frac{1}{2}(y_k-t_k)^2}{\partial a_k}=(y_k-t_k)\dfrac{\partial y_k}{\partial a_k}=(y_k-t_k)\dfrac{\partial h(a_k)}{\partial a_k}$$ where $h(a_k)$ is the activation function. So why in the book $\delta_k=y_k-t_k$??

In page 242, Section 5.3. Error Backpropagation,

Consider a simple linear model where the outputs $y_k$ are linear combinations of the input variables $x_i$ so that $y_k=\sum_iw_{ki}x_i$. For a particular input pattern $n$, the error function is $E_n=\dfrac{1}{2}\sum_k(y_{nk}-t_{nk})^2$, where $y_{nk}=y_k(\boldsymbol{x_n},\boldsymbol{w})$. So the gradient of this error function with respect to a weight $w_{ij}$ is given by $$\frac{\partial E_n}{\partial w_{ji}}=(y_{nj}-t_{nj})x_{ni}$$ which can be interpreted as a ‘local’ computation involving the product of an ‘error signal’ $y_{nj} − t_{nj}$ associated with the output end of the link $w_{ji}$ and the variable $x_{ni}$ associated with the input end of the link.

Question 2: I am not clear with the structure of this neural network. The one in the book is a two-layer neural network with linear activation, is it?

n1k31t4 · Accepted Answer · 2018-07-31 11:04:19Z

As you point out, with $\delta_k = y_k - t_k$, the author is stating the relationship between the final units' output and the target. So Equation 5.54 is simply stating:

the error on the $k^{th}$ output unit is the difference between its output and the target.

I believe that $\delta_k$ refers to the error simply at the output, whereas $\delta_j$ is the derivative of the output at unit $n$ with respect to any neuron back in the network, $j$. If this is the case, doing your derivative as you did, for $\delta_k$ (instead of $\delta_j$) means you are computing the error gradient between the output and the final layer. Over this layer, the activation must be linear (we do not apply a non-linearity at the final layers output). This would mean your $\dfrac{\partial h(a_k)}{\partial a_k}$ term would actually fall away in this case (to a constant), using the $k$ subscript - and you have the answer from the author: $y_t - t_k$.

The network that is considered in your second question seems to only really discuss a single fully-connected layer, so yes, two actual layers of neurons - input and output only with no activation (so a linear activation, as you mention).

By "we do not apply a non-linearity at the final layers output", do you mean it is a common practice, or do you have a source for that statement? I'm new to neural network. — CyberPlayerOne
– CyberPlayerOne, Commented Aug 1, 2018 at 4:38
The final layer produces the prediction(s). In regression you'd expect a number. It'd be unnecessary to put that through a final non-linear activation function. In classification the final layer's output logits don't go through a non-linearity in order to increase the power of the network, rather just through a softmax to allow the output to be interpreted as probabilities (softmax squashes values into [0, 1]). Here is a reference. Imagine using a ReLU on the final layer... it'd make it impossible for the network to predict negative values. — n1k31t4
– n1k31t4, Commented Aug 1, 2018 at 11:45

Stack Exchange Network

Questions about Neural Network training (back propagation) in the book PRML (Pattern Recognition and Machine Learning)

1 Answer 1

Hot Network Questions

Questions about Neural Network training (back propagation) in the book PRML (Pattern Recognition and Machine Learning)

1 Answer 1

Related

Hot Network Questions