Question

我正在尝试在tensorflow中实现深度确定性策略渐变算法，但该策略并没有收敛到任何远程好的东西。我正在测试车轮问题。

随着时间的推移，评论家的损失减少到0，演员的渐变也会收敛到0，但奖励不会增加。看起来这位演员采取了“不变”的政策，它将推车向一个方向推进，直到剧集失败。我使用Ornstein-Uhlenbeck过程来增加噪声，随着时间的推移sigma衰减，初始角度是随机的，因此代理可以很好地探索状态空间。

对演员进行渐变上升与渐变下降似乎没什么区别。某处肯定会出现一些非常错误，但我不知道如何找到它。

这是顶级演员更新：

            predicted_actions = actor.get_actions(session, replay_states)
            critic_action_gradients = critic.get_action_gradients(
                session, replay_states, predicted_actions
            )
            gradient_magnitude = actor.update_weights(
                session, replay_states, critic_action_gradients
            )
            actor.update_target_network(session)

这是评论家get_action_gradients：

def get_action_gradients(self, session, states, actions):
    return [
        session.run(self.action_gradients, feed_dict={
            self.nnet_input_state: np.array([state], dtype=np.float32),
            self.nnet_input_action: np.array([action], dtype=np.float32),
        }) for state, action in zip(states, actions)
    ]

self.action_gradients只是

    self.action_gradients = tf.gradients(self.output, self.nnet_input_action)

这是演员的update_weights：

def update_weights(self, session, replay_states, critic_action_gradients):
    # each row of actor_gradients is multiplied by the corresponding critic gradient
    # then take a column-wise average
    # shape of actor_gradients is len(replay_states) x 6, each column has the
    # shape of the corresponding network weight
    critic_gradients = np.array(critic_action_gradients).reshape((len(replay_states), 1))
    actor_gradients = np.array(self.get_param_gradients(session, replay_states))
    avg_gradients = (actor_gradients * critic_gradients).mean(axis=0)

    new_params = session.run(self.update_weights_ops, feed_dict={
        op: grad for op, grad in zip(self.gradient_placeholders, avg_gradients)
    })

    return sum(np.sum(x) for g in avg_gradients for x in g)

更新权重的图表操作是：

    self.network_params = [self.weights_1, self.bias_1,
                           self.weights_2, self.bias_2,
                           self.weights_3, self.bias_3]
    self.param_gradients = tf.gradients(self.output, self.network_params)

    self.gradient_placeholders = []
    self.update_weights_ops = []
    for param in self.network_params:
        gradient_placeholder = tf.placeholder(shape=param.shape, dtype=tf.float32)
        update_op = param.assign_add(self.learning_rate * gradient_placeholder)

        self.gradient_placeholders.append(gradient_placeholder)
        self.update_weights_ops.append(update_op)

DDPG没有收敛

0 个答案: