我正在尝试建立一个深层的Q网络来玩蛇。我遇到了一个问题,即代理人不学习,并且在训练周期结束时其表现是要反复杀死自己。经过一些调试,我发现网络预测的Q值每次都是相同的。动作空间为[上,右,下,左],并且网络预测为[0,0,1,0]。训练损失的确会随着时间的流逝而减少,但似乎并没有改变。这是训练代码:
def train(self):
tf.logging.set_verbosity(tf.logging.ERROR)
self.build_model()
for episode in range(self.max_episodes):
self.current_episode = episode
env = SnakeEnv(self.screen)
episode_reward = 0
for timestep in range(self.max_steps):
env.render(self.screen)
state = self.screenshot()
#state = env.get_state()
action = None
epsilon = self.current_eps
if epsilon > random.random():
action = np.random.choice(env.action_space) #explore
else:
values = self.policy_model.predict(state) #exploit
action = np.argmax(values)
experience = env.step(action)
if(experience['done'] == True):
episode_reward += experience['reward']
break
episode_reward += experience['reward']
self.push_memory(Experience(experience['state'], experience['action'], experience['reward'], experience['next_state']))
self.decay_epsilon(episode)
if self.can_sample_memory():
memory_sample = self.sample_memory()
X = []
Y = []
for memory in memory_sample:
memstate = memory.state
action = memory.action
next_state = memory.next_state
reward = memory.reward
max_q = reward + (self.discount_rate * self.replay_model.predict(next_state)) #bellman equation
X.append(memstate)
Y.append(max_q)
X = np.array(X)
X = X.reshape([-1, 600, 600, 2])
Y = np.array(Y)
Y = Y.reshape([self.batch_size, 4])
self.policy_model.fit(X, Y)
food_eaten = experience["food_eaten"]
print("Episode: ", episode, " Total Reward: ", episode_reward, " Food Eaten: ", food_eaten)
if episode % self.target_update == 0:
self.replay_model.set_weights(self.policy_model.get_weights())
self.policy_model.save_weights('weights.hdf5')
pygame.quit()
这是网络架构:
self.policy_model = Sequential()
self.policy_model.add(Conv2D(8, (5, 5), padding = 'same', activation = 'relu', data_format = "channels_last", input_shape = (600, 600, 2)))
self.policy_model.add(Conv2D(16, (5, 5), padding="same", activation="relu"))
self.policy_model.add(Conv2D(32, (5, 5), padding="same", activation="relu"))
self.policy_model.add(Flatten())
self.policy_model.add(Dense(16, activation = "relu"))
self.policy_model.add(Dense(4, activation = "softmax"))
self.policy_model.compile(optimizer = 'rmsprop', loss = 'mean_squared_error')
以下是超参数:
learning_rate = 0.5
discount_rate = 0.99
eps_start = 1
eps_end = .01
eps_decay = .006
memory_size = 100000
batch_size = 8
max_episodes = 1000
max_steps = 5000
target_update = 10
我已经让它训练了整整1000集,最后还是很糟糕的。我在训练算法上做错了吗?
编辑:忘了提及该特工获得食物时获得0.5的奖励,食用食物时获得1的奖励,而垂死者获得-1的奖励
答案 0 :(得分:1)
强化学习算法需要非常低的优化器学习率(例如1.e-4或更低),以免学习得太快而无法适应环境的某个子空间(看起来像您的问题)。在这里,您似乎使用了优化器的默认学习率(rmsprop,默认值为0.001)。
无论如何,这可能是一个可能的原因:)
答案 1 :(得分:0)
注意ε衰减。它设置了随时间推移的勘探探索权衡。如果您的ε衰减太大,它将开始利用状态动作空间的很小(未探索)空间。 至少在大多数情况下,至少在我不正确的政策中早日趋于一致是由于ε衰减过大所致。