Question

我正在努力应用Tensorflow急切执行（TF2），以训练演员评论家DDPG algorithm的演员。 this example中的描述说明：

使用参与者的在线网络，以当前状态为输入来获取动作平均值。然后，使用评论者在线网络以获取评论者输出相对于动作平均值∇aQ（s，a）|的梯度。 s ＝ s_t，a ＝μ（s_t）。给定∇aQ（s，a），使用链式规则，计算演员输出相对于演员权重的梯度。最后，将这些渐变应用于actor网络。

演员和评论家都是Keras模特。抱歉没有发布完整的代码，尽管我希望我可以从这些相关的摘要中理解我的问题。

def fit_actor(self, state):

    action = self.predict_actor(state)              #online actor
    q_value = self.predict_critic(state, action)    #online critic

    param_gradient = self.tape_online_critic.gradient(q_value, [action]) 
    #=> is alway [None], also for tape_online_actor

    gradient = zip(param_gradient, self.online_actor.trainable_weights)

    self.optimizer_actor.apply_gradients(gradient)

在“ predict_critic”中，评论家受够了状态和动作，因此“ tape_online_critic”具有操作

#for online critic and online actor
def predict_critic(self, state, actions):

    with self.tape_online_critic as tape:
        return self.online_critic([state, actions])


def predict_actor(self, state):

    with self.tape_online_actor as tape:
        return self.online_actor([state])

我尝试了几乎所有可能的变量/带等组合，但是我总是使用 [无] 和 ValueError 梯度。

ValueError: No gradients provided for any variable: ['conv2d_2/kernel:0']

我不确定tape.gradient()函数的参数顺序，甚至是争论本身的性质。
据我了解，我必须在在线演员的磁带上调用tape.gradient()函数，以获取演员权重相对于with中预测变量功能中记录的操作的梯度。磁带：上下文。我是否还需要tape.watch(someTensor)那里的东西？ documentation说变量会被自动监视，所以我假设不会。
在这里使用tf.GradientTape还是正确的东西吗？还有tfe.gradients_function吗？

具有急切执行力的Tensorflow：DDPG-将动作梯度应用于Actor

0 个答案: