我正在尝试在tensorflow中实现深度确定性策略渐变算法,但该策略并没有收敛到任何远程好的东西。我正在测试车轮问题。
随着时间的推移,评论家的损失减少到0,演员的渐变也会收敛到0,但奖励不会增加。看起来这位演员采取了“不变”的政策,它将推车向一个方向推进,直到剧集失败。我使用Ornstein-Uhlenbeck过程来增加噪声,随着时间的推移sigma衰减,初始角度是随机的,因此代理可以很好地探索状态空间。
对演员进行渐变上升与渐变下降似乎没什么区别。某处肯定会出现一些非常错误,但我不知道如何找到它。
这是顶级演员更新:
predicted_actions = actor.get_actions(session, replay_states)
critic_action_gradients = critic.get_action_gradients(
session, replay_states, predicted_actions
)
gradient_magnitude = actor.update_weights(
session, replay_states, critic_action_gradients
)
actor.update_target_network(session)
这是评论家get_action_gradients
:
def get_action_gradients(self, session, states, actions):
return [
session.run(self.action_gradients, feed_dict={
self.nnet_input_state: np.array([state], dtype=np.float32),
self.nnet_input_action: np.array([action], dtype=np.float32),
}) for state, action in zip(states, actions)
]
self.action_gradients
只是
self.action_gradients = tf.gradients(self.output, self.nnet_input_action)
这是演员的update_weights
:
def update_weights(self, session, replay_states, critic_action_gradients):
# each row of actor_gradients is multiplied by the corresponding critic gradient
# then take a column-wise average
# shape of actor_gradients is len(replay_states) x 6, each column has the
# shape of the corresponding network weight
critic_gradients = np.array(critic_action_gradients).reshape((len(replay_states), 1))
actor_gradients = np.array(self.get_param_gradients(session, replay_states))
avg_gradients = (actor_gradients * critic_gradients).mean(axis=0)
new_params = session.run(self.update_weights_ops, feed_dict={
op: grad for op, grad in zip(self.gradient_placeholders, avg_gradients)
})
return sum(np.sum(x) for g in avg_gradients for x in g)
更新权重的图表操作是:
self.network_params = [self.weights_1, self.bias_1,
self.weights_2, self.bias_2,
self.weights_3, self.bias_3]
self.param_gradients = tf.gradients(self.output, self.network_params)
self.gradient_placeholders = []
self.update_weights_ops = []
for param in self.network_params:
gradient_placeholder = tf.placeholder(shape=param.shape, dtype=tf.float32)
update_op = param.assign_add(self.learning_rate * gradient_placeholder)
self.gradient_placeholders.append(gradient_placeholder)
self.update_weights_ops.append(update_op)