Tic Tac Toe的Q学习无法收敛

时间:2020-04-12 19:18:49

标签: python deep-learning reinforcement-learning tic-tac-toe q-learning

我正在尝试通过Q学习方法来解决井字游戏。我正在为玩家X和O使用2个Q表,我将以1奖励胜利,以-1输掉,以0.5抽签。

由于表现出色,所以每场比赛都以平局结束,因此我衡量的是过去100场比赛的平局比率。不幸的是,我无法使其融合。我的(相关)代码如下:

def update_qtable(state, next_state, action, player, qtable, discount, lr):
    state_idx = all_possible_states.index(state)
    # For a full board, there are no more actions allowed (select_actions produces error)
    if next_state.count(' ') > 0:
        next_action = select_action(next_state, player, eps=0)  # 0 = epsilon (greedy)
    else:
        next_action = 0
    # Get Reward
    reward = 0
    winner, game_status = check_result(next_state)
    if game_status == 'Done' and winner == player:
        reward = 1
    if game_status == 'Done' and winner != player:
        reward = -1
    if game_status == 'Draw':
        reward = 0.5
    next_state_idx = all_possible_states.index(next_state)
    # calculate long-term reward
    observed = reward + discount * qtable[next_state_idx, next_action]
    # update Q-Table
    qtable[state_idx, action] = qtable[state_idx, action] * (1 - lr) + observed * lr
     .
     .
     .

#for each training epoch:
while game_status == 'Not Done':
    if players_turn == 0:  # X's move
        # print("\nAI X's turn!")
        action = select_action(state, 'X', epsilon)
        new_state = play_move(state, 'X', action)
    else:  # O's move
        # print("\nAI O's turn!")
        action = select_action(state, 'O', epsilon)
        new_state = play_move(state, 'O', action)

    # Q Values of terminal states (full board) shall not be updated
    if state.count(' ') > 0:
        update_qtable(state, new_state, action, 'X', qtable_X, discount=0.95, lr=learning_rate[epoch])
        update_qtable(state, new_state, action, 'O', qtable_O, discount=0.95, lr=learning_rate[epoch])

    state = new_state.copy()

我训练了大约5000个纪元,但学习率和ε降低了(ε贪婪法)。一旦epsilon值变小,就继续玩相同的游戏序列,这导致第一个玩家获胜。这是Q学习的问题,奖励仅在游戏结束时给出,因此在早期动作中未得到处理吗?

我希望您能阅读我的代码,但我不确定您需要多少信息。 谢谢!

0 个答案:

没有答案