我正在尝试通过Q学习方法来解决井字游戏。我正在为玩家X和O使用2个Q表,我将以1奖励胜利,以-1输掉,以0.5抽签。
由于表现出色,所以每场比赛都以平局结束,因此我衡量的是过去100场比赛的平局比率。不幸的是,我无法使其融合。我的(相关)代码如下:
def update_qtable(state, next_state, action, player, qtable, discount, lr):
state_idx = all_possible_states.index(state)
# For a full board, there are no more actions allowed (select_actions produces error)
if next_state.count(' ') > 0:
next_action = select_action(next_state, player, eps=0) # 0 = epsilon (greedy)
else:
next_action = 0
# Get Reward
reward = 0
winner, game_status = check_result(next_state)
if game_status == 'Done' and winner == player:
reward = 1
if game_status == 'Done' and winner != player:
reward = -1
if game_status == 'Draw':
reward = 0.5
next_state_idx = all_possible_states.index(next_state)
# calculate long-term reward
observed = reward + discount * qtable[next_state_idx, next_action]
# update Q-Table
qtable[state_idx, action] = qtable[state_idx, action] * (1 - lr) + observed * lr
.
.
.
#for each training epoch:
while game_status == 'Not Done':
if players_turn == 0: # X's move
# print("\nAI X's turn!")
action = select_action(state, 'X', epsilon)
new_state = play_move(state, 'X', action)
else: # O's move
# print("\nAI O's turn!")
action = select_action(state, 'O', epsilon)
new_state = play_move(state, 'O', action)
# Q Values of terminal states (full board) shall not be updated
if state.count(' ') > 0:
update_qtable(state, new_state, action, 'X', qtable_X, discount=0.95, lr=learning_rate[epoch])
update_qtable(state, new_state, action, 'O', qtable_O, discount=0.95, lr=learning_rate[epoch])
state = new_state.copy()
我训练了大约5000个纪元,但学习率和ε降低了(ε贪婪法)。一旦epsilon值变小,就继续玩相同的游戏序列,这导致第一个玩家获胜。这是Q学习的问题,奖励仅在游戏结束时给出,因此在早期动作中未得到处理吗?
我希望您能阅读我的代码,但我不确定您需要多少信息。 谢谢!