How do we derive our loss function from the gradient objective?

Question

I've been dwelling through RL theory and practice and one particular part I find hard to properly understand is the relation between the practical loss function and the theoretical objective/gradient of the objective. How do we derive one from another? Maybe it is easier with some examples:

Reinforce with baseline:

Theoretical Critic Objective Gradient

Practical loss (pseudocode) I've come across:

critic_loss = 0.5 * (returns - v_hat(states))**2

Semi gradient SARSA:

Theoretical Objective Gradient

Practical loss (pseudocode) I've come across:

loss = (r + gamma * qvalue(s_next,a_next) - qvalue(s,a)) ** 2

Where are these coming from? Is it the negative inverse of ∇J(w)? If so, could somebody show how it's done? I can only find resources on how to do this for w with fixed dimensions.

I might be completely off here, not getting some foundational property. If that is the case, I would thank deeply if you could direct me to relevant (introductory) literature.

matorbi · Accepted Answer · 2022-07-27 18:51:56Z

The pseudocode corresponds to: (for the first example)

$$ J(w)=\frac{1}{2}(G_t-\hat{v}(s_t, w))^2 $$

You can obtain the first expression by simply explicitly calculating the gradient.

$$ \nabla_w J(w)=-(G_t-\hat{v}(s_t, w))\nabla_w \hat{v}(s_t, w) $$

For the second example, I am not sure about your source, but I think it should be (see p.244 in Sutton-Barto)

$$R+\gamma \hat{q}(s',a',w)-\hat{q}(s,a,w))\nabla_w\hat{q}(s,a,w). $$

Now indeed, this is not the gradient of $$ J(w)=\frac{1}{2}(R +\gamma \hat{q}(s',a',w)-\hat{q}(s,a,w)))^2$$

There should be probably more context in the code and there is with high probability some stop gradient (in tensorflow) or detach (in pytorch) to indicate that if we take the gradient of that loss function, it corresponds indeed to the desired update (that is we neglect that $\hat{q}(s',a',w)$ depends on $w$ when taking the gradient). This trick is used, so that we can use compute-efficient backpropagation.

Stack Exchange Network

How do we derive our loss function from the gradient objective?

Reinforce with baseline:

Semi gradient SARSA:

1 Answer 1

Hot Network Questions

How do we derive our loss function from the gradient objective?

Reinforce with baseline:

Semi gradient SARSA:

1 Answer 1

Related

Hot Network Questions