我正在尝试实现一个可以玩垄断游戏的dqn代理。目前,我正在针对采取随机有效决定的特工对这一人员进行培训。我已经尝试过使用不同的网络体系结构,但是我以前使用的那些似乎都没有融合。
代理的状态由23个浮点数表示。动作(网络的期望输出)预期在0到30之间,即代理在给定时间可以执行的最大动作数。我还实现了类似的状态函数,该函数比较状态并将其声明为相似或不相似,以减少用于训练的非常相似状态的数量。
我尝试了2种具有不同参数的不同网络体系结构,但又一次,它们都没有融合:
模型1:
learning_rate: float = 0.005
gamma: float = 0.95
exploration_rate: float = 1.0
exploration_min: float = 0.2
exploration_decay: float = 0.996
model = Sequential()
model.add(Dense(26, activation='relu', input_dim=23))
model.add(Dense(26, activation='relu'))
model.add(Dense(30, activation='softmax'))
model.compile(optimizer=Adam(lr=learning_rate), loss='categorical_crossentropy', metrics=['accuracy'])
模型2:
learning_rate: float = 0.15
gamma: float = 0.95 # discount rate (AKA - gamma (epsilon))
exploration_rate: float = 0.7
exploration_min: float = 0.2
exploration_decay: float = 0.996
model = Sequential()
model.add(Dense(40, activation='relu', input_dim=23))
model.add(Dense(40, activation='relu'))
model.add(Dense(30, activation='softmax'))
model.compile(optimizer=Adam(lr=learning_rate), loss='categorical_crossentropy', metrics=['accuracy'])
以下功能显示了培训的完成方式: 我必须注意,如果自情节开始以来已过去6秒钟,则情节结束(假设ai已被证明能够通过玩大约3000轮游戏在4秒内赢得比赛)。这是为了减少培训时间。
def train(self):
"""
This function is intended to train a deep neural network by playing the game over a range of x episodes.
:return:
"""
self.training_dqn = self.dqn()
for index_episode in range(self.training_dqn.episodes):
self.reset_game()
#player of level 2 is the one being trained
if self.winner is not None and self.winner.level == 2:
self.reward = 10
else:
self.reward = -10
remember(self.state, self.action, self.reward, self.next_state, self.done)
replay(self.dqn.sample_batch_size)
save_model()
以下函数执行dqn学习:
def replay(self, sample_batch_size):
"""
A sample size of events is taken from the memory to teach the neural network the optimal actions
to take in a given state.
:param sample_batch_size: int - size of batch to learn from (teach the deep neural network)
"""
if sample_batch_size >= len(self.memory):
sample_batch = self.memory
else:
sample_batch = random.sample(self.memory, sample_batch_size)
for state, action, reward, next_state, done in sample_batch:
target = reward
if not done:
# preparing input for NN.
next_state = np.array(next_state)
next_state = next_state.reshape(-1, 23)
############################################
next_state_prediction = sum(self.ql_network.predict(next_state).tolist(), [])
next_state_prediction = next_state_prediction.index(max(next_state_prediction))
target = reward + self.gamma * next_state_prediction
# preparing input for NN.
state = np.array(state)
state = state.reshape(-1, 23)
##################################
target_f = self.ql_network.predict(state)
target_f[0][action] = target
target_f = sum(target_f.tolist(), [])
target_f = target_f.index(max(target_f))
self.ql_network.fit(state, to_categorical([target_f], num_classes=30), epochs=1, verbose=0)
if self.exploration_rate > self.exploration_min:
self.exploration_rate *= self.exploration_decay
我不太确定,为什么它没有收敛。也许我需要使用一个完全不同的网络体系结构,该体系结构可能适合于具有如此众多可能状态的问题。谢谢你提前!