为什么我的Deep Q Network无法学习简单的游戏?

时间:2020-04-20 19:19:01

标签: python machine-learning deep-learning tensorflow2.0 reinforcement-learning

所以我制作了一个小型的python游戏,玩家必须到达终点并避免陷阱,而且看起来像这样

enter image description here

我尝试了许多不同的批处理大小,奖励,输入形状,隐藏层中的节点数量,但是网络仍然没有训练。

我目前的训练方式是使用64个批处理大小和100000个内存大小,输入是一个1D数组,表示游戏状态+玩家坐标+游戏结束前剩余移动量,以及奖励从-distanceFromEnd + maxDistance / 2开始,如果到达终点,则获得+500奖励并且游戏结束;如果触摸陷阱,则获得-100奖励并且游戏结束;如果游戏未在64中完成移动获得-200奖励,游戏结束。

我正在使用AdamOptimizer和MSE损失函数,并且对于激活函数,我对除最后一层之外的所有层都使用了ReLU,

每集结束后,玩家,结尾,陷阱的位置都被随机分配

即使播放了3000集,最近100场比赛的平均得分(分数也是奖励的总和)约为-30。
DQN在健身游戏LunarLander-v2上运行良好。
正如我所说,我一直在尝试调整价值,但这并没有帮助。

首先,这是我在该州使用的标签

  FLOOR = 1
  END = 2
  TRAP = 3
  PLAYER = 4

这是我的步进功能

 def step(self, action):
isDone = False
if action == 0:
  # Move Up
  if self.playerY != 0:
    self.playerY -= 1
elif action == 1:
  # Move Down
  if self.playerY != 7:
    self.playerY += 1
elif action == 2:
  # Move Right
  if self.playerX != 0:
    self.playerX -= 1
elif action == 3:
  # Move Left
  if self.playerX != 7:
    self.playerX += 1

x = self.playerX - self.endX
x = x * x
y = self.playerY - self.endY
y = y * y

distance = math.sqrt(x + y)
reward = -distance + self.maxDist
#self.lastDist = distance

if self.state[self.playerX, self.playerY] == self.END:
  reward = 500
  isDone = True
elif self.state[self.playerX, self.playerY] == self.TRAP:
  reward = -100
  isDone = True

self.moves -= 1

if self.moves < 0:
  reward = -200
  isDone = True

return self.getFlatState(), reward, isDone, 0

状态获取器功能

  # Adding one to the players coordinates to avoid 0s as a try to fix the problem
  def getFlatState(self):
     return np.concatenate([np.ndarray.flatten(self.state), [self.playerX + 1, self.playerY + 1, self.moves]])

这是DQN / Agent脚本

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import load_model

class ReplayBuffer():
def __init__(self, max_size, input_dims):
    self.mem_size = max_size
    self.mem_cntr = 0

    self.state_memory = np.zeros((self.mem_size, *input_dims), 
                                dtype=np.float32)
    self.new_state_memory = np.zeros((self.mem_size, *input_dims),
                            dtype=np.float32)
    self.action_memory = np.zeros(self.mem_size, dtype=np.int32)
    self.reward_memory = np.zeros(self.mem_size, dtype=np.float32)
    self.terminal_memory = np.zeros(self.mem_size, dtype=np.int32)

def store_transition(self, state, action, reward, state_, done):
    index = self.mem_cntr % self.mem_size
    self.state_memory[index] = state
    self.new_state_memory[index] = state_
    self.reward_memory[index] = reward
    self.action_memory[index] = action
    self.terminal_memory[index] = 1 - int(done)
    self.mem_cntr += 1

def sample_buffer(self, batch_size):
    max_mem = min(self.mem_cntr, self.mem_size)
    batch = np.random.choice(max_mem, batch_size, replace=False)

    states = self.state_memory[batch]
    states_ = self.new_state_memory[batch]
    rewards = self.reward_memory[batch]
    actions = self.action_memory[batch]
    terminal = self.terminal_memory[batch]

    return states, actions, rewards, states_, terminal

def build_dqn(lr, n_actions, input_dims, fc1_dims, fc2_dims):
model = keras.Sequential([
    keras.layers.Dense(fc1_dims, activation='relu'),
    keras.layers.Dense(fc2_dims, activation='relu'),
    keras.layers.Dense(n_actions, activation=None)])
model.compile(optimizer=Adam(learning_rate=lr), loss='mean_squared_error')

return model

class Agent():
def __init__(self, lr, gamma, n_actions, epsilon, batch_size,
            input_dims, epsilon_dec=1e-3, epsilon_end=0.01,
            mem_size=1000000, fname='dqn_model.h5'):
    self.action_space = [i for i in range(n_actions)]
    self.gamma = gamma
    self.epsilon = epsilon
    self.eps_dec = epsilon_dec
    self.eps_min = epsilon_end
    self.batch_size = batch_size
    self.model_file = fname
    self.memory = ReplayBuffer(mem_size, input_dims)
    self.q_eval = build_dqn(lr, n_actions, input_dims, 256, 128)

def store_transition(self, state, action, reward, new_state, done):
    self.memory.store_transition(state, action, reward, new_state, done)

def choose_action(self, observation):
    if np.random.random() < self.epsilon:
        action = np.random.choice(self.action_space)
    else:
        state = np.array([observation])
        actions = self.q_eval.predict(state)

        action = np.argmax(actions)

    return action

def learn(self):
    if self.memory.mem_cntr < self.batch_size:
        return

    states, actions, rewards, states_, dones = \
            self.memory.sample_buffer(self.batch_size)

    q_eval = self.q_eval.predict(states)
    q_next = self.q_eval.predict(states_)


    q_target = np.copy(q_eval)
    batch_index = np.arange(self.batch_size, dtype=np.int32)

    q_target[batch_index, actions] = rewards + \
                    self.gamma * np.max(q_next, axis=1)*dones


    self.q_eval.train_on_batch(states, q_target)

    self.epsilon = self.epsilon - self.eps_dec if self.epsilon > \
             self.eps_min else self.eps_min

def save_model(self):
    self.q_eval.save(self.model_file)


def load_model(self):
    self.q_eval = load_model(self.model_file)

1 个答案:

答案 0 :(得分:0)

问题是代理的目标状态位置和初始位置不稳定。如OP所报告的,这些问题得到解决后,代理会“大约90%的时间”开始持续获胜。

尽管远非完美,但我对天真的DQN期望不高。使用A3C甚至DDQN(双深度Q学习)等更高级的技术应该可以帮助您解决问题。随着我们使用更高级的技术来开始解决更复杂的问题。

可以通过一些更多样化的方法(例如“蒙特卡洛”)来完成未来计划不多的小而轻松的任务。但是这里的主要问题是,您的障碍是随机产生的,简单的DQN并没有预先制定避免采取red areas带来负面奖励的道路。

DQN本质上是Q学习,但是值以更压缩的状态存储,以便容纳更多的可能。因此,对于如此复杂的解决方案,它是不可靠的(如前所述)。因此,简单地说,解决方案是仅使用更复杂和新的方法,我已经提到了许多方法。