Backpropagation chain rule - why is this application valid?

Question

The backpropagation general form, as I have been told, works for any directed acyclic graph. For a proof, I am pointed to the same derivation that was given when there are neat hidden layers, and each hidden layer only depends on neurons from the immediate previous hidden layer. (For example, the derivation in Tom Mitchell's Machine Learning or Wikipedia.) I can see how the derivation works in a neural net when the hidden layers are neatly organized, using the multivariate chain rule: Basic Neural Network

However, in a general DAG, there may not be distinct hidden layers. Consider a structure like the image: DAG Neural Network

The output neuron depends on neurons j, k. Neuron j depends on neurons x, y. Neuron k depends on neuron y. Neuron y depends on neuron x. Neuron x is fed input (not shown).

Following the Wikipedia proof, the chain rule is applied to $net_j$ like it does not contain $net_y$. A separate term for $net_y$ is also required. And, according to the Wikipedia proof, the $\frac{\partial net_j }{\partial o_x}$ is evaluated simply to $w_{xj}$. First, is the application of the chain rule, ignoring the presence of $net_y$ in $net_j$ valid? Most calculus texts require the chained inside functions to have the same arguments, and none should be dependent on another chained inside function. Then, the evaluation of $\frac{\partial net_j }{\partial o_x}$ is plain wrong. It seems like the proofs (Mitchell and Wikipedia) are invalid, and yet they are accepted. Why are the proofs considered correct, then? (or what will make the proof correct?)

Not sure why you are considering $\frac{\partial\mathrm{net}_j}{\partial o_x}$, considering that $\mathrm{net}_j$ is not a function of $o_x$. — Sangchul Lee
– Sangchul Lee, Commented Aug 4 at 17:05
Also note that, in gradient descent, you differentiate the loss function with respect to the parameters and not with respect to the inputs of the network. (Inputs enter this picture through datasets, which are essentially constants given a dataset.) That is why, for example, you can regard $\mathrm{net}_y$ as constants (hence not participating in differentiation) when you differentiate the loss w.r.t. the weights of neuron $j$. — Sangchul Lee
– Sangchul Lee, Commented Aug 4 at 17:11
The proofs in Mitchell and Wikipedia use that notation. I think it refers to the fact that $net_j$ contains a term $w_{xj}o_x$, so $net_j$ can be viewed as a function of $o_x$. $o_x$ is not an input. — User3290
– User3290, Commented Aug 4 at 17:12
$o_x$ means output of x, though. That's different from o, which would have output $o_o$. — User3290
– User3290, Commented Aug 4 at 17:31
Thanks for clarification, I was a bit hesitating and didn't pay close attention to how notation is set up in the article. — Sangchul Lee
– Sangchul Lee, Commented Aug 5 at 4:09

Sangchul Lee · Accepted Answer · 2025-08-07 01:32:52Z

I believe the confusion arises from the interpretation of $\mathrm{net}_j$.

Part 1. To clarify, we first review a general result. Let $\mathsf{G} = (\mathsf{V}, \mathsf{E})$ be a finite directed acyclic graph (DAG). For each node $v \in \mathsf{V}$, we denote the set of parents (i.e., nodes with incoming edges to $v$) by $\mathsf{pa}(v)$ and the set of children (i.e., nodes with outgoing edges from $v$) by $\mathsf{ch}(v)$. Assume that each node $v$ is a variable and a differentiable function of its parents, i.e., $v = f_v(\mathsf{pa}(v))$ for some differentiable function $f_v : \mathbb{R}^{\mathsf{pa}(v)} \to \mathbb{R}$.

Let $x$ be a node in $\mathsf{G}$, and let $\mathsf{de}_0(x)$ denote the set of all descendants of $x$, including $x$ itself. If $v \in \mathsf{de}_0(x)$, we express $v$ by recursively applying the functional relation $w = f_w(\mathsf{pa}(w))$ until either $x \in \mathsf{pa}(w)$ or $\mathsf{pa}(w) \cap \mathsf{de}_0(x) = \varnothing$. This allows us to write $v$ as a function of $x$ and the "non-descendants" of $x$. We denote this functional relation as $F_{vx}$: $$ v = F_{vx}(x, \mathsf{V}\setminus\mathsf{de}_0(x)). $$

Example. In OP's DAG, we have $$j = f_j(x, y) = f_j(x, f_y(x))$$ and hence $j = F_{jx} = f_j(x, f_y(x))$.

To avoid ambiguity, we denote the variable $v$ in this context by $[v]_x$, or simply $[v]$ if the variable $x$ is clear from the context. (Note that this is not standard notation.) Then, we have:

Theorem. Under the above setting,

$$ \frac{\partial [v]}{\partial x} = \sum_{y \in \mathsf{ch}(x)} \frac{\partial [v]}{\partial y}\frac{\partial y}{\partial x}. $$ More formally,

$$ \frac{\partial F_{vx}}{\partial x} = \sum_{y \in \mathsf{ch}(x)} \frac{\partial F_{vy}}{\partial y}\frac{\partial f_y}{\partial x}. $$

This result is not immediately obvious, so we provide a proof at the end for completeness.

Part 2. Now, we return to the original problem. The neural network represented by the DAG takes the form:

Here, we assume $x$ is the input layer, so $o_x$ represents the input, and each node is a non-learnable function (i.e., a function without learnable parameters) of its parents. This graph can be used to compute the derivative of $E$ with respect to each weight.

Using the notation from Part 1, we have: $$\begin{align*} \frac{\partial [\mathrm{net}_j]}{\partial o_x} &= \frac{\partial [\mathrm{net}_j]}{\partial \mathrm{net}_j}\frac{\partial \mathrm{net}_j}{\partial o_x} + \frac{\partial [\mathrm{net}_j]}{\partial \mathrm{net}_y}\frac{\partial \mathrm{net}_y}{\partial o_x} \\ &= w_{xj} + \frac{\partial [\mathrm{net}_j]}{\partial \mathrm{net}_y}w_{xy}. \end{align*}$$

This explains the discrepancy between the proof in the Wikipedia article and the OP's argument. In the article, the theorem in Part 1 is used to expand the partial derivatives of $E$ with respect to neuron outputs. As such, $\frac{\partial \mathrm{net}_j}{\partial o_x}$ refers to the partial derivative of $\mathrm{net}_j$ as a function of its direct parents, namely $o_y$, $w_{yj}$, $o_x$, and $w_{xj}$. In contrast, OP considers the derivative $\frac{\partial [\mathrm{net}_j]}{\partial o_x}$, where $\mathrm{net}_j$ is expanded as a composite function of $o_x$ (and other non-descendants). These two derivatives must be distinguished.

Proof of Theorem. Let $v$ be a descendent of $x$.

If $\mathsf{pa}(v)$ does not contain $x$ but contains some of the descendents of $x$, then $F_{vx}$ satisfies the relation:

$$ F_{vx} = f_v(F_{wx} : w \in \mathsf{pa}(v)). $$

So by the chain rule, we obtain

$$ \begin{align*} \frac{\partial F_{vx}}{\partial x} &= \sum_{v_1 \in \mathsf{pa}(v)} \frac{\partial f_v}{\partial v_1} \frac{\partial F_{v_1x}}{\partial x} \end{align*}$$
Otherwise, $\mathsf{pa}(v)$ is a subset of $\{x\} \cup (\mathsf{V}\setminus\mathsf{de}_0(x)) $ and $F_{vx} = f_v$. Hence,

$$ \begin{align*} \frac{\partial F_{vx}}{\partial x} &= \begin{cases} \frac{\partial f_v}{\partial x} & \text{if $x \in \mathsf{pa}(v)$}, \\ 0, & \text{if $x \notin \mathsf{pa}(v)$}. \end{cases} \end{align*}$$

Repeatedly applying these relations, we end up with

$$ \begin{align*} \frac{\partial F_{vx}}{\partial x} &= \sum_{x = v_n \to v_{n-1} \to \cdots \to v_1 \to v_0 = v} \frac{\partial v_0}{\partial v_1} \frac{\partial v_1}{\partial v_2} \cdots \frac{\partial v_{n-1}}{\partial v_n}, \end{align*}$$

where the final sum is over all paths from $x$ to $v$. The desired conclusion follows by grouping the sum by the penultimate node $v_{n-1}$ (i.e., the successor of $x$) in the path.

Stack Exchange Network

Backpropagation chain rule - why is this application valid?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Backpropagation chain rule - why is this application valid?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions