Question

我正在尝试为Q学习中的学习目的构建自己的环境，并通过简单的神经网络和线性激活对其进行培训。问题是，似乎并没有学会玩这种简单的游戏，它是由玩家在不接触敌人的情况下达到目标的。即使在2000次情节之后，平均奖励金额仍保持在同一区域。如果有人在我的代码段中发现了问题，我将不胜感激。

# model
model = Sequential()
model.add(Dense(4,activation='relu',input_shape=(4,)))
model.add(Dense(4,activation='relu'))
model.add(Dense(4,activation='linear'))
model.compile(loss='mse', optimizer='adam')
###
# parameters
MOVE_REWARD = -1
ENEMY_REWARD = -100
GOAL_REWARD = 100
epsilon = 0.5
EPS_DECAY = 0.9999
DISCOUNT = 0.9
LEARNING = 0.8
###
# the learning loop in a loop with number of episodes:
# (after creating new positions for player, goal and enemy)
for i in range(200):
    [some code]
    # create delta coordinates of actual game state (obs)
    # which consists of 1x4 vector with 4 +/- values
    # player to goal in x,y and player to enemy in x,y
    if np.random.random() > epsilon:
        action = np.argmax(model.predict(obs))
    else:
        action = np.random.randint(0, 4)
    player.move(action) # player moves in one direction
    if player == enemy:
        reward = ENEMY_REWARD
        _end_ = True
    elif player == goal:
        reward = GOAL_REWARD
        _end_ = True
    else:
        reward = MOVE_REWARD
    [...]
    # create delta coordinates of actual game state (new_obs)
    # which consists of 1x4 vector with 4 +/- values
    # player to goal in x,y and player to enemy in x,y
    max_future_q = np.max(model.predict(new_obs))
    current_q = model.predict(obs)[0][action]
    if reward == GOAL_REWARD:
        new_q = GOAL_REWARD
    else:
        new_q = (1 - LEARNING) * current_q + LEARNING * (reward + DISCOUNT * max_future_q)
    target_vec = model.predict(obs)[0]
    target_vec[action] = new_q
    target_vec=target_vec.reshape(1,4)
    model.fit(obs,target_vec,verbose=0,epochs=1)
    if _end_ == True:
        break

我试图用卷积网络重写它，但遇到了同样的问题。如果有人在我的算法中发现错误，那就太好了！

为什么我的Deep Q学习网络无法学习？

0 个答案: