Question

我一直在尝试使用 TensorFlow 为 this paper on various RL methods 中描述的 DQN 实施训练步骤，但是当我尝试使用 GradientTape 计算梯度时，我得到了 ValueError: No gradients provided for any variable:。以下是训练步骤代码：

def train_step(model, target, optimizer, observations, actions, rewards, next_observations):
    with tf.GradientTape() as tape:
        target_logits = tf.math.reduce_max(target(np.expand_dims(next_observations, -1)), 1)
        logits = model(np.expand_dims(observations, -1))

        act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
        
        for i in range(EXPERIENCE_SAMPLE_SIZE):
            act_logits[i] = logits[i][actions[i]]

        act_logits = tf.convert_to_tensor(act_logits, dtype=tf.float32)

        y_T = tf.math.add(tf.convert_to_tensor(rewards, dtype=tf.float32), tf.math.scalar_mul(DISCOUNT_RATE, target_logits))

        loss = tf.math.squared_difference(act_logits, y_T)
        loss = tf.math.scalar_mul(1.0 / EXPERIENCE_SAMPLE_SIZE, loss)

    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

其中 model 和 target 是 tf.keras.Sequential，它们输出采取 5 个可能操作中的每一个的预期值，优化器是 SGD，而 observations、actions 、rewards 和 next_observations 是从体验回放缓冲区中采样的 numpy 数组。

这是实现上述论文中的以下伪代码的一部分：

我最好的猜测是这个错误是因为索引 logits 使梯度无法区分，但我不知道如何计算 Q*(s,a,theta) 数量。

Answer 1

为了社区的利益，在答案部分添加解决方案。

From Comments：

问题通过替换代码解决：

act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
   
for i in range(EXPERIENCE_SAMPLE_SIZE):
    act_logits[i] = logits[i][actions[i]]

代码：

act_logits = tf.math.reduce_max(tf.math.multiply(act_logits, logits), 1)

ValueError：计算损失时没有为任何变量提供梯度

1 个答案: