我已经检查过this question并确认这不是重复问题。
问题:
我已经实现了一个使用DQN和TensorFlow的代理来学习一个名为“点和盒子”的游戏的最佳策略。该算法似乎实际上是基于对随机玩家的滚动平均赢率而工作,但问题是最终DQN输出的Q值变得太大而无法表达并变为[inf],此时错误是提升并且Q功能不再可用。
我的奖励结构非常简单。代理人获得-1的损失和1获胜。我还将所有渐变剪切到介于-1和1之间,这是为了避免这种行为。降低学习率似乎可以避免这种Q值爆炸,但是如果有足够的时间,它就会发生。
我在下面提供了相关代码:
渐变剪辑:
# Clip gradients to prevent gradient explosion
gradients = self.optimizer.compute_gradients(self.loss)
clipped_gradients = [(tf.clip_by_value(grad,-1.,1.), var) for grad, var in gradients]
self.update_model = self.optimizer.apply_gradients(clipped_gradients)
(优化器是RMSProp优化器)
更新方法:
def td_update(self, current_state, last_action, next_state, reward):
"""Updates the Q_function according to the SARSA update algorithm"""
# Update the replay table
self.replay_table[self.transition_count % self.replay_size] = (current_state, last_action, next_state, reward)
self.transition_count = (self.transition_count + 1)
# Don't start learning until transition table has some data
if self.transition_count >= self.update_size * 20:
if self.transition_count == self.update_size * 20:
print("Replay Table is Ready\n")
# Get a random subsection of the replay table for mini-batch update
random_tbl = random.choice(self.replay_table[:min(self.transition_count,self.replay_size)],size=self.update_size)
feature_vectors = np.vstack(random_tbl['state'])
actions = random_tbl['action']
next_feature_vectors = np.vstack(random_tbl['next_state'])
rewards = random_tbl['reward']
# Get the indices of the non-terminal states
non_terminal_ix = np.where([~np.any(np.isnan(next_feature_vectors),axis=(1,2,3))])[1]
q_current = self.get_Q_values(feature_vectors)
# Default q_next will be all zeros (this encompasses terminal states)
q_next = np.zeros([self.update_size,len(self._environment.action_list)])
q_next[non_terminal_ix] = self.get_Q_values(next_feature_vectors[non_terminal_ix])
# The target should be equal to q_current in every place
target = q_current.copy()
# Only actions that have been taken should be updated with the reward
# This means that the target - q_current will be [0 0 0 0 0 0 x 0 0....]
# so the gradient update will only be applied to the action taken
# for a given feature vector.
target[np.arange(len(target)), actions] += (rewards + self.gamma*q_next.max(axis=1))
# Logging
if self.log_file is not None:
print ("Current Q Value: {}".format(q_current),file=self.log_file)
print ("Next Q Value: {}".format(q_next),file=self.log_file)
print ("Current Rewards: {}".format(rewards),file=self.log_file)
print ("Actions: {}".format(actions),file=self.log_file)
print ("Targets: {}".format(target),file=self.log_file)
# Log some of the gradients to check for gradient explosion
loss, output_grad, conv_grad = self.sess.run([self.loss,self.output_gradient,self.convolutional_gradient],
feed_dict={self.target_Q: target, self.input_matrix: feature_vectors})
print ("Loss: {}".format(loss),file=self.log_file)
print ("Output Weight Gradient: {}".format(output_grad),file=self.log_file)
print ("Convolutional Gradient: {}".format(conv_grad),file=self.log_file)
# Update the model
self.sess.run(self.update_model, feed_dict={self.target_Q: target, self.input_matrix: feature_vectors})
我亲自检查了目标,并且我相信根据我对算法的了解并根据我的测试结果得到了正确的值。如果有人能够让我了解为什么会发生这种情况或者我能做些什么来阻止它,我会非常感激。如果我需要提供更多信息,请告诉我。