我现在正在尝试优化机器人的导航。我首先使用香草DQN在其中优化了参数。该模拟机器人在经过5000集后达到了8000个目标,并且表现出令人满意的学习效果。 现在,由于DQN在强化学习中是“不是最好的”,我添加了DoubleDQN。不幸的是,那个人在相同条件下表现很差。 我的第一个问题是,如果我正确地实施了DDQN,第二个问题,应该多久优化一次目标网络?现在,它在每个情节之后都进行了优化。一集可以持续到500步(如果没有崩溃)。我可以想象更频繁地更新目标(即每20步)。但是我不知道目标然后如何能够阻止原始网络的高估行为?
这是正常的DQN培训部分:
def getQvalue(self, reward, next_target, done):
if done:
return reward
else:
return reward + self.discount_factor * np.amax(next_target)
def getAction(self, state):
if np.random.rand() <= self.epsilon:
self.q_value = np.zeros(self.action_size)
return random.randrange(self.action_size)
else:
q_value = self.model.predict(state.reshape(1, len(state)))
self.q_value = q_value
return np.argmax(q_value[0])
def trainModel(self, target=False):
mini_batch = random.sample(self.memory, self.batch_size)
X_batch = np.empty((0, self.state_size), dtype=np.float64)
Y_batch = np.empty((0, self.action_size), dtype=np.float64)
for i in range(self.batch_size):
states = mini_batch[i][0]
actions = mini_batch[i][1]
rewards = mini_batch[i][2]
next_states = mini_batch[i][3]
dones = mini_batch[i][4]
q_value = self.model.predict(states.reshape(1, len(states)))
self.q_value = q_value
if target:
next_target = self.target_model.predict(next_states.reshape(1, len(next_states)))
else:
next_target = self.model.predict(next_states.reshape(1, len(next_states)))
next_q_value = self.getQvalue(rewards, next_target, dones)
X_batch = np.append(X_batch, np.array([states.copy()]), axis=0)
Y_sample = q_value.copy()
Y_sample[0][actions] = next_q_value
Y_batch = np.append(Y_batch, np.array([Y_sample[0]]), axis=0)
if dones:
X_batch = np.append(X_batch, np.array([next_states.copy()]), axis=0)
Y_batch = np.append(Y_batch, np.array([[rewards] * self.action_size]), axis=0)
self.model.fit(X_batch, Y_batch, batch_size=self.batch_size, epochs=1, verbose=0)
这是Double DQN的更新:
def getQvalue(self, reward, next_target, next_q_value_1, done):
if done:
return reward
else:
a = np.argmax(next_q_value_1[0])
return reward + self.discount_factor * next_target[0][a]
def getAction(self, state):
if np.random.rand() <= self.epsilon:
self.q_value = np.zeros(self.action_size)
return random.randrange(self.action_size)
else:
q_value = self.model.predict(state.reshape(1, len(state)))
self.q_value = q_value
return np.argmax(q_value[0])
def trainModel(self, target=False):
mini_batch = random.sample(self.memory, self.batch_size)
X_batch = np.empty((0, self.state_size), dtype=np.float64)
Y_batch = np.empty((0, self.action_size), dtype=np.float64)
for i in range(self.batch_size):
states = mini_batch[i][0]
actions = mini_batch[i][1]
rewards = mini_batch[i][2]
next_states = mini_batch[i][3]
dones = mini_batch[i][4]
q_value = self.model.predict(states.reshape(1, len(states)))
self.q_value = q_value
if target:
next_q_value_1 = self.model.predict(next_states.reshape(1, len(next_states)))
next_target = self.target_model.predict(next_states.reshape(1, len(next_states)))
else:
next_q_value_1 = self.model.predict(next_states.reshape(1, len(next_states)))
next_target = self.model.predict(next_states.reshape(1, len(next_states)))
# next_q_value = self.getQvalue(rewards, next_target, next_q_value_1, dones)
X_batch = np.append(X_batch, np.array([states.copy()]), axis=0)
Y_sample = q_value.copy()
Y_sample[0][actions] = next_q_value
Y_batch = np.append(Y_batch, np.array([Y_sample[0]]), axis=0)
if dones:
X_batch = np.append(X_batch, np.array([next_states.copy()]), axis=0)
Y_batch = np.append(Y_batch, np.array([[rewards] * self.action_size]), axis=0)
self.model.fit(X_batch, Y_batch, batch_size=self.batch_size, epochs=1, verbose=0)
基本上,更改发生在getQvalue部分中,在该部分中,我从原始网络中选择了操作,然后从目标网络中为该操作选择了操作值。如果目标确定,则仅在2000个全局步骤之后才使用目标网络。在不合理之前(〜前10集) 最好的问候和预先的感谢!
答案 0 :(得分:0)
您不应该在每个情节上都更新目标网络。因为引入了目标网络以稳定q值训练。根据环境,更新频率应在100、1000或10000之间。
您可以检查此问题。我已经修改了以下问题的代码:Cartpole-v0 loss increasing using DQN