强化学习算法 —— 为什么TRPO算法使用状态值(V)而不是动作值进行计算?
Why Don’t You Just Use a Q-Function?
Previous actor critic methods, e.g. in [KT03], use a Q-function to obtain potentially
low-variance policy gradient estimates. Recent papers, including [Hee+15; Lil+15], have
shown that a neural network Q-function approximator can used effectively in a policy
gradient method. However, there are several advantages to using a state-value function
in the manner of this paper. First, the state-value function has a lower-dimensional input
and is thus easier to learn than a state-action value function. Second, the method of this
paper allows us to smoothly interpolate between the high-bias estimator ( = 0) and
the low-bias estimator ( = 1). On the other hand, using a parameterized Q-function
only allows us to use a high-bias estimator. We have found that the bias is prohibitively
large when using a one-step estimate of the returns, i.e., the = 0 estimator, Aˆ t = V
t =
rt + V(st+1) - V(st). We expect that similar difficulty would be encountered when using an advantage estimator involving a parameterized Q-function, Aˆ t = Q(s, a) - V(s).
There is an interesting space of possible algorithms that would use a parameterized
Q-function and attempt to reduce bias, however, an exploration of these possibilities is
beyond the scope of this work


