我正在尝试为Q学习中的学习目的构建自己的环境,并通过简单的神经网络和线性激活对其进行培训。问题是,似乎并没有学会玩这种简单的游戏,它是由玩家在不接触敌人的情况下达到目标的。即使在2000次情节之后,平均奖励金额仍保持在同一区域。如果有人在我的代码段中发现了问题,我将不胜感激。
# model
model = Sequential()
model.add(Dense(4,activation='relu',input_shape=(4,)))
model.add(Dense(4,activation='relu'))
model.add(Dense(4,activation='linear'))
model.compile(loss='mse', optimizer='adam')
###
# parameters
MOVE_REWARD = -1
ENEMY_REWARD = -100
GOAL_REWARD = 100
epsilon = 0.5
EPS_DECAY = 0.9999
DISCOUNT = 0.9
LEARNING = 0.8
###
# the learning loop in a loop with number of episodes:
# (after creating new positions for player, goal and enemy)
for i in range(200):
[some code]
# create delta coordinates of actual game state (obs)
# which consists of 1x4 vector with 4 +/- values
# player to goal in x,y and player to enemy in x,y
if np.random.random() > epsilon:
action = np.argmax(model.predict(obs))
else:
action = np.random.randint(0, 4)
player.move(action) # player moves in one direction
if player == enemy:
reward = ENEMY_REWARD
_end_ = True
elif player == goal:
reward = GOAL_REWARD
_end_ = True
else:
reward = MOVE_REWARD
[...]
# create delta coordinates of actual game state (new_obs)
# which consists of 1x4 vector with 4 +/- values
# player to goal in x,y and player to enemy in x,y
max_future_q = np.max(model.predict(new_obs))
current_q = model.predict(obs)[0][action]
if reward == GOAL_REWARD:
new_q = GOAL_REWARD
else:
new_q = (1 - LEARNING) * current_q + LEARNING * (reward + DISCOUNT * max_future_q)
target_vec = model.predict(obs)[0]
target_vec[action] = new_q
target_vec=target_vec.reshape(1,4)
model.fit(obs,target_vec,verbose=0,epochs=1)
if _end_ == True:
break
我试图用卷积网络重写它,但遇到了同样的问题。如果有人在我的算法中发现错误,那就太好了!