DQN垄断代理商未学习

时间:2020-05-14 12:10:33

标签: python tensorflow keras neural-network tf.keras

我正在尝试实现一个可以玩垄断游戏的dqn代理。目前,我正在针对采取随机有效决定的特工对这一人员进行培训。我已经尝试过使用不同的网络体系结构,但是我以前使用的那些似乎都没有融合。

代理的状态由23个浮点数表示。动作(网络的期望输出)预期在0到30之间,即代理在给定时间可以执行的最大动作数。我还实现了类似的状态函数,该函数比较状态并将其声明为相似或不相似,以减少用于训练的非常相似状态的数量。

我尝试了2种具有不同参数的不同网络体系结构,但又一次,它们都没有融合:

模型1:

learning_rate: float = 0.005
gamma: float = 0.95 
exploration_rate: float = 1.0
exploration_min: float = 0.2
exploration_decay: float = 0.996

model = Sequential()
model.add(Dense(26, activation='relu', input_dim=23))
model.add(Dense(26, activation='relu'))
model.add(Dense(30, activation='softmax'))
model.compile(optimizer=Adam(lr=learning_rate), loss='categorical_crossentropy', metrics=['accuracy'])

模型2:

learning_rate: float = 0.15
gamma: float = 0.95  # discount rate (AKA - gamma (epsilon))
exploration_rate: float = 0.7
exploration_min: float = 0.2
exploration_decay: float = 0.996

model = Sequential()
model.add(Dense(40, activation='relu', input_dim=23))
model.add(Dense(40, activation='relu'))
model.add(Dense(30, activation='softmax'))
model.compile(optimizer=Adam(lr=learning_rate), loss='categorical_crossentropy', metrics=['accuracy'])

以下功能显示了培训的完成方式: 我必须注意,如果自情节开始以来已过去6秒钟,则情节结束(假设ai已被证明能够通过玩大约3000轮游戏在4秒内赢得比赛)。这是为了减少培训时间。

def train(self):
        """
        This function is intended to train a deep neural network by playing the game over a range of x episodes.
        :return:
        """
        self.training_dqn = self.dqn()
        for index_episode in range(self.training_dqn.episodes):
            self.reset_game()
            #player of level 2 is the one being trained
            if self.winner is not None and self.winner.level == 2:
                self.reward = 10
            else:
                self.reward = -10
            remember(self.state, self.action, self.reward, self.next_state, self.done)
            replay(self.dqn.sample_batch_size)
            save_model()

以下函数执行dqn学习:

def replay(self, sample_batch_size):
        """
        A sample size of events is taken from the memory to teach the neural network the optimal actions
        to take in a given state.
        :param sample_batch_size: int - size of batch to learn from (teach the deep neural network)
        """
        if sample_batch_size >= len(self.memory):
            sample_batch = self.memory
        else:
            sample_batch = random.sample(self.memory, sample_batch_size)
        for state, action, reward, next_state, done in sample_batch:
            target = reward
            if not done:
                # preparing input for NN.
                next_state = np.array(next_state)
                next_state = next_state.reshape(-1, 23)
                ############################################
                next_state_prediction = sum(self.ql_network.predict(next_state).tolist(), [])
                next_state_prediction = next_state_prediction.index(max(next_state_prediction))
                target = reward + self.gamma * next_state_prediction
            # preparing input for NN.
            state = np.array(state)
            state = state.reshape(-1, 23)
            ##################################
            target_f = self.ql_network.predict(state)
            target_f[0][action] = target
            target_f = sum(target_f.tolist(), [])
            target_f = target_f.index(max(target_f))
            self.ql_network.fit(state, to_categorical([target_f], num_classes=30), epochs=1, verbose=0)
        if self.exploration_rate > self.exploration_min:
            self.exploration_rate *= self.exploration_decay

我不太确定,为什么它没有收敛。也许我需要使用一个完全不同的网络体系结构,该体系结构可能适合于具有如此众多可能状态的问题。谢谢你提前!

0 个答案:

没有答案