Question

我正在尝试实现Vanilla Policy渐变，这基本上是使用Advantage函数的REINFORCE算法。为了估计优势函数，必须计算值函数V（s）。仅使用Return的REINFORCE可以工作，但是在尝试用Advantage函数替换它之后，我遇到了一个错误：ValueError：没有为任何变量提供渐变谢谢您的帮助，如果有帮助，我会向您发送完整的代码

    # make action selection op (outputs int actions, sampled from policy)
    actions = tf.squeeze(tf.multinomial(logits=logits,num_samples=1), axis=1)

    #computing value function
    value_app = tf.squeeze(funct.critic_nn(obs_ph), axis=1)

    # make loss function whose gradient, for the right data, is policy gradient
    weights_ph = tf.placeholder(shape=(None,), dtype=tf.float32)
    adv_ph = tf.placeholder(shape=(None,), dtype=tf.float32)
    v_ph = tf.placeholder(shape=(None,), dtype=tf.float32)
    act_ph = tf.placeholder(shape=(None,), dtype=tf.int32)

    #Loss for actor
    action_masks = tf.one_hot(act_ph, n_acts)
    log_probs = tf.reduce_sum(action_masks * tf.nn.log_softmax(logits), axis=1)
    loss = -tf.reduce_mean(adv_ph * log_probs)

    #Loss for critic
    critic_loss = tf.reduce_mean((v_ph - weights_ph)**2)

    #optimizers
    train_actor = tf.train.AdamOptimizer(learning_rate=lr).minimize(loss)
    train_critic = tf.train.AdamOptimizer(learning_rate=1e-3).minimize(critic_loss)

Answer 1

我已经弄清楚了。问题是我使用占位符构建了批评者损失函数，并从Value函数神经网络中获取了记录。但是，需要不使用占位符（v_ph），而是实际使用神经网络的实际输出。这意味着您应该记录环境中的状态，并在训练阶段供稿中记录，该供稿通过值函数逼近器进行记录，并使用其输出建立损失函数，这将被最小化。

critic_loss = tf.reduce_mean((value_app - weights_ph)**2)

值函数逼近器中的错误= ValueError：没有为任何变量提供渐变

1 个答案: