我正在制作一个程序,教导2名玩家使用强化学习和基于余量的时间差异学习方法(TD(λ))玩简单的棋盘游戏。通过训练神经网络进行学习。我使用Sutton's NonLinear TD/Backprop neural network)我真的很想你对我的后续困境的看法。 用于在两个对手之间进行转弯的基本算法/伪代码是
WHITE.CHOOSE_ACTION(GAME_STATE); //White player decides on its next move by evaluating the current game state ( TD(λ) learning)
GAME_STATE = WORLD.APPLY(WHITE_PLAYERS_ACTION); //We apply the chosen action of the player to the environment and a new game state emerges
IF (GAME STATE != FINAL ){ // If the new state is not final (not a winning state for white player), do the same for the Black player
BLACK.CHOOSE_ACTION(GAME_STATE)
GAME_STATE = WORLD.APPLY(BLACK_PLAYERS_ACTION) // We apply the chosen action of the black player to the environment and a new game state emerges.
}
每个玩家什么时候应该调用他的学习方法PLAYER.LEARN(GAME_STATE)。这是dillema。
选项A. 在每个球员的移动之后,在新的击败之后立即出现,如下:
WHITE.CHOOSE_ACTION(GAME_STATE);
GAME_STATE = WORLD.APPLY(WHITE_PLAYERS_ACTION);
WHITE.LEARN(GAME_STATE) // White learns from the afterstate that emerged right after his action
IF (GAME STATE != FINAL ){
BLACK.CHOOSE_ACTION(GAME_STATE)
GAME_STATE = WORLD.APPLY(BLACK_PLAYERS_ACTION)
BLACK.LEARN(GAME_STATE) // Black learns from the afterstate that emerged right after his action
选项B. 在每个球员的移动之后,在新的击败之后立即出现,而且在对手移动之后,如果对手做出了胜利的举动。
WHITE.CHOOSE_ACTION(GAME_STATE);
GAME_STATE = WORLD.APPLY(WHITE_PLAYERS_ACTION);
WHITE.LEARN(GAME_STATE)
IF (GAME_STATE == FINAL ) //If white player won
BLACK.LEARN(GAME_STATE) // Make the Black player learn from the White player's winning afterstate
IF (GAME STATE != FINAL ){ //If white player's move did not produce a winning/final afterstate
BLACK.CHOOSE_ACTION(GAME_STATE)
GAME_STATE = WORLD.APPLY(BLACK_PLAYERS_ACTION)
BLACK.LEARN(GAME_STATE)
IF (GAME_STATE == FINAL) //If Black player won
WHITE.LEARN(GAME_STATE) //Make the White player learn from the Black player's winning afterstate
我认为选项B更合理。
答案 0 :(得分:0)
通常,通过TD学习,代理将具有3个功能:
行动与学习相结合,游戏结束时也会进行更多学习。