Question

我正在制作一个程序，教导2名玩家使用强化学习和基于余量的时间差异学习方法（TD（λ））玩简单的棋盘游戏。通过训练神经网络进行学习。我使用Sutton's NonLinear TD/Backprop neural network）我真的很想你对我的后续困境的看法。用于在两个对手之间进行转弯的基本算法/伪代码是

WHITE.CHOOSE_ACTION(GAME_STATE); //White player decides on its next move by evaluating the current game state ( TD(λ) learning)

GAME_STATE = WORLD.APPLY(WHITE_PLAYERS_ACTION);  //We apply the chosen action of the player to the environment and a new game state emerges

 IF (GAME STATE != FINAL ){ // If the new state is not final (not a winning state for white player), do the same for the Black player

    BLACK.CHOOSE_ACTION(GAME_STATE)

GAME_STATE = WORLD.APPLY(BLACK_PLAYERS_ACTION) // We apply the chosen action of the black player to the environment and a new game state emerges.
}

每个玩家什么时候应该调用他的学习方法PLAYER.LEARN（GAME_STATE）。这是dillema。

选项A. 在每个球员的移动之后，在新的击败之后立即出现，如下：

WHITE.CHOOSE_ACTION(GAME_STATE);
GAME_STATE = WORLD.APPLY(WHITE_PLAYERS_ACTION);
WHITE.LEARN(GAME_STATE)    // White learns from the afterstate that emerged right after his action
IF (GAME STATE != FINAL ){
    BLACK.CHOOSE_ACTION(GAME_STATE)
    GAME_STATE = WORLD.APPLY(BLACK_PLAYERS_ACTION)
    BLACK.LEARN(GAME_STATE) // Black learns from the afterstate that emerged right after his action

选项B. 在每个球员的移动之后，在新的击败之后立即出现，而且在对手移动之后，如果对手做出了胜利的举动。

WHITE.CHOOSE_ACTION(GAME_STATE);
GAME_STATE = WORLD.APPLY(WHITE_PLAYERS_ACTION);
WHITE.LEARN(GAME_STATE)
IF (GAME_STATE == FINAL ) //If white player won
    BLACK.LEARN(GAME_STATE) // Make the Black player learn from the White player's winning afterstate
IF (GAME STATE != FINAL ){ //If white player's move did not produce a winning/final afterstate
    BLACK.CHOOSE_ACTION(GAME_STATE)
    GAME_STATE = WORLD.APPLY(BLACK_PLAYERS_ACTION)
    BLACK.LEARN(GAME_STATE)
    IF (GAME_STATE == FINAL) //If Black player won
        WHITE.LEARN(GAME_STATE) //Make the White player learn from the Black player's winning afterstate

我认为选项B更合理。

Answer 1

通常，通过TD学习，代理将具有3个功能：

开始（观察）→行动
步骤（观察，奖励）→行动
完成（奖励）

行动与学习相结合，游戏结束时也会进行更多学习。

强化学习 - 从后遗症学习TD

1 个答案: