强化学习 - 从后遗症学习TD

时间:2015-07-05 04:54:46

标签: machine-learning reinforcement-learning temporal-difference

我正在制作一个程序,教导2名玩家使用强化学习和基于余量的时间差异学习方法(TD(λ))玩简单的棋盘游戏。通过训练神经网络进行学习。我使用Sutton's NonLinear TD/Backprop neural network)我真的很想你对我的后续困境的看法。 用于在两个对手之间进行转弯的基本算法/伪代码是

WHITE.CHOOSE_ACTION(GAME_STATE); //White player decides on its next move by evaluating the current game state ( TD(λ) learning)

GAME_STATE = WORLD.APPLY(WHITE_PLAYERS_ACTION);  //We apply the chosen action of the player to the environment and a new game state emerges

 IF (GAME STATE != FINAL ){ // If the new state is not final (not a winning state for white player), do the same for the Black player

    BLACK.CHOOSE_ACTION(GAME_STATE)

GAME_STATE = WORLD.APPLY(BLACK_PLAYERS_ACTION) // We apply the chosen action of the black player to the environment and a new game state emerges.
}

每个玩家什么时候应该调用他的学习方法PLAYER.LEARN(GAME_STATE)。这是dillema。

选项A. 在每个球员的移动之后,在新的击败之后立即出现,如下:

WHITE.CHOOSE_ACTION(GAME_STATE);
GAME_STATE = WORLD.APPLY(WHITE_PLAYERS_ACTION);
WHITE.LEARN(GAME_STATE)    // White learns from the afterstate that emerged right after his action
IF (GAME STATE != FINAL ){
    BLACK.CHOOSE_ACTION(GAME_STATE)
    GAME_STATE = WORLD.APPLY(BLACK_PLAYERS_ACTION)
    BLACK.LEARN(GAME_STATE) // Black learns from the afterstate that emerged right after his action

选项B. 在每个球员的移动之后,在新的击败之后立即出现,而且在对手移动之后,如果对手做出了胜利的举动。

WHITE.CHOOSE_ACTION(GAME_STATE);
GAME_STATE = WORLD.APPLY(WHITE_PLAYERS_ACTION);
WHITE.LEARN(GAME_STATE)
IF (GAME_STATE == FINAL ) //If white player won
    BLACK.LEARN(GAME_STATE) // Make the Black player learn from the White player's winning afterstate
IF (GAME STATE != FINAL ){ //If white player's move did not produce a winning/final afterstate
    BLACK.CHOOSE_ACTION(GAME_STATE)
    GAME_STATE = WORLD.APPLY(BLACK_PLAYERS_ACTION)
    BLACK.LEARN(GAME_STATE)
    IF (GAME_STATE == FINAL) //If Black player won
        WHITE.LEARN(GAME_STATE) //Make the White player learn from the Black player's winning afterstate

我认为选项B更合理。

1 个答案:

答案 0 :(得分:0)

通常,通过TD学习,代理将具有3个功能:

  • 开始(观察)→行动
  • 步骤(观察,奖励)→行动
  • 完成(奖励)

行动与学习相结合,游戏结束时也会进行更多学习。