Question

我正在尝试通过Q学习方法来解决井字游戏。我正在为玩家X和O使用2个Q表，我将以1奖励胜利，以-1输掉，以0.5抽签。

由于表现出色，所以每场比赛都以平局结束，因此我衡量的是过去100场比赛的平局比率。不幸的是，我无法使其融合。我的（相关）代码如下：

def update_qtable(state, next_state, action, player, qtable, discount, lr):
    state_idx = all_possible_states.index(state)
    # For a full board, there are no more actions allowed (select_actions produces error)
    if next_state.count(' ') > 0:
        next_action = select_action(next_state, player, eps=0)  # 0 = epsilon (greedy)
    else:
        next_action = 0
    # Get Reward
    reward = 0
    winner, game_status = check_result(next_state)
    if game_status == 'Done' and winner == player:
        reward = 1
    if game_status == 'Done' and winner != player:
        reward = -1
    if game_status == 'Draw':
        reward = 0.5
    next_state_idx = all_possible_states.index(next_state)
    # calculate long-term reward
    observed = reward + discount * qtable[next_state_idx, next_action]
    # update Q-Table
    qtable[state_idx, action] = qtable[state_idx, action] * (1 - lr) + observed * lr
     .
     .
     .

#for each training epoch:
while game_status == 'Not Done':
    if players_turn == 0:  # X's move
        # print("\nAI X's turn!")
        action = select_action(state, 'X', epsilon)
        new_state = play_move(state, 'X', action)
    else:  # O's move
        # print("\nAI O's turn!")
        action = select_action(state, 'O', epsilon)
        new_state = play_move(state, 'O', action)

    # Q Values of terminal states (full board) shall not be updated
    if state.count(' ') > 0:
        update_qtable(state, new_state, action, 'X', qtable_X, discount=0.95, lr=learning_rate[epoch])
        update_qtable(state, new_state, action, 'O', qtable_O, discount=0.95, lr=learning_rate[epoch])

    state = new_state.copy()

我训练了大约5000个纪元，但学习率和ε降低了（ε贪婪法）。一旦epsilon值变小，就继续玩相同的游戏序列，这导致第一个玩家获胜。这是Q学习的问题，奖励仅在游戏结束时给出，因此在早期动作中未得到处理吗？

我希望您能阅读我的代码，但我不确定您需要多少信息。谢谢！

Tic Tac Toe的Q学习无法收敛

0 个答案: