Question

我很难让Deep Q-Learning代理找到最佳策略。这就是我当前的模型在TensorFlow中的样子：

model = Sequential()

model.add(Dense(units=32, activation="relu", input_dim=self.env.state.size)),
model.add(Dense(units=self.env.allActionsKeys.size, activation="softmax"))

model.compile(loss="mse", optimizer=Adam(lr=0.00075), metrics=['accuracy'])

针对当前正在处理的问题，'self.env.state.size'等于6，并且可能采取的措施（'self.env.allActionsKeys.size'）为30。

输入向量由位组成，每个位具有不同的范围（尽管在这个问题上并不过分）。 2位的范围是[0,3]，其他2位[0,2]和其余的[0,1]。请注意，这应该是一个简单的问题，我还针对更复杂的问题，例如，输入大小为15，范围可以相差一点（[0,15]，[0 ，3]，...）。

这是我的火车方法的样子：

def train(self, terminal_state):
    if len(self.replay_memory) < MIN_REPLAY_MEMORY_SIZE:
        return

    # Get MINIBATCH_SIZE random samples from replay_memory
    minibatch = random.sample(self.replay_memory, MINIBATCH_SIZE)

    # Transition: (current_state, action, reward, normalized_next_state, next_state, done)

    current_states = np.array([transition[0] for transition in minibatch])
    current_qs_minibatch = self.model.predict(current_states, batch_size=MINIBATCH_SIZE, use_multiprocessing=True)

    next_states = np.array([transition[3] for transition in minibatch])
    next_qs_minibatch = self.model.predict(next_states, batch_size=MINIBATCH_SIZE, use_multiprocessing=True)

    env_get_legal_actions = self.env.get_legal_actions
    np_max = np.max

    X = []
    y = []

    for index, (current_state, action, reward, normalized_next_state, next_state, done) in enumerate(minibatch):
        if not done:
            legalActionsIds = env_get_legal_actions(next_state)
            max_next_q = np_max(next_qs_minibatch[index][legalActionsIds])

            new_q = reward + DISCOUNT * max_next_q
        else:
            new_q = reward

        current_qs = current_qs_minibatch[index].copy()
        current_qs[action] = new_q

        X.append(current_state)
        y.append(current_qs)

    self.model.fit(np.array(X), np.array(y), batch_size=MINIBATCH_SIZE, verbose=0, shuffle=False)

其中DISCOUNT = 0.99和MINIBATCH_SIZE = 64

我了解到建议对输入向量进行归一化，因此我测试了2种不同的属性归一化方法：min-max norm。和z得分规范。而且，由于值范围相差不大，因此我也进行了归一化测试。这些方法都没有一个比其他方法更好。

发生的事情是，在一开始，在探索阶段，分数会随着时间的流逝而变好，这意味着该模型正在学习一些东西，但是随后，在探索阶段，当ε值很低并且代理占据了大部分贪婪地行动，分数急剧下降，这意味着它实际上没有学到什么好东西。

这是我的深度Q学习算法：

epsilon = 1

for episode in range(1, EPISODES+1):
    episode_reward = 0
    step = 0
    done = False
    current_state = env_reset()

    while not done:
        normalized_current_state = env_normalize(current_state)

        if np_random_number() > epsilon:  # Take legal action greedily
            actionsQValues = agent_get_qs(normalized_current_state)
            legalActionsIds = env_get_legal_actions(current_state)
            # Make the argmax selection among the legal actions
            action = legalActionsIds[np_argmax(actionsQValues[legalActionsIds])]
        else:  # Take random legal action
            action = env_sample()

        new_state, reward, done = env_step(action)

        episode_reward += reward

        agent_update_replay_memory((normalized_current_state, action, reward, env_normalize(new_state), new_state, done))
        agent_train(done)

        current_state = new_state
        step += 1

    # Decay epsilon
    if epsilon > MIN_EPSILON:
        epsilon *= EPSILON_DECAY

其中EPISODES = 4000，EPSILON_DECAY = 0.9995。

我尝试了所有这些超参数，但结果非常相似。我不知道还能尝试什么。我在规范化方面做错了吗？还有其他推荐的标准化方法吗？问题可能出在我的神经网络模型中还不够好吗？

我认为使它能够解决这样一个简单的问题并不难，例如输入大小为6，输出层为30个节点，隐藏层为32的

请注意，对于同一问题，我使用大小为14的二进制数组使用状态的不同表示形式，并且在使用相同的超参数的情况下可以正常工作。那么当我使用其他表示形式时，可能是什么问题？

Answer 1

我发现模型实现存在问题。激活函数不应为 softmax ，而应为 linear 。至少在我看来，这种方式效果更好。

为什么在这种“深度Q学习”模型的开发阶段，分数（累积奖励）会下降？

1 个答案: