我正在尝试为tictactoe实施q-learning。这样做的一个步骤涉及枚举tictactoe板的所有可能状态以形成状态值表。我编写了一个程序,以递归方式从空板开始生成所有可能的状态。为此,我隐式执行搜索空间树的预先遍历遍历。然而,在这一切结束时,我只获得了707个独特的州,而普遍的共识是,合法国家的数量大约为5000个。
注意:我指的是合法州的数量。我知道,如果任何一名球员在比赛结束后被允许继续比赛(我的意思是非法国家),那么州的数量接近19,000。
CODE:
def generate_state_value_table(self, state, turn):
winner = int(is_game_over(state)) #check if, for the current turn and state, game has finished and if so who won
#print "\nWinner is ", winner
#print "\nBoard at turn: ", turn
#print_board(state)
self.add_state(state, winner/2 + 0.5) #add the current state with the appropriate value to the state table
open_cells = open_spots(state) #find the index (from 0 to total no. of cells) of all the empty cells in the board
#check if there are any empty cells in the board
if len(open_cells) > 0:
for cell in open_cells:
#pdb.set_trace()
row, col = cell / len(state), cell % len(state)
new_state = deepcopy(state) #make a copy of the current state
#check which player's turn it is
if turn % 2 == 0:
new_state[row][col] = 1
else:
new_state[row][col] = -1
#using a try block because recursive depth may be exceeded
try:
#check if the new state has not been generated somewhere else in the search tree
if not self.check_duplicates(new_state):
self.generate_state_value_table(new_state, turn+1)
else:
return
except:
#print "Recursive depth exceeded"
exit()
else:
return
如果需要,您可以查看完整代码here。
修改 我在链接中整理了一些代码,并在这里添加了更多注释以使事情变得更清晰。希望有所帮助。
答案 0 :(得分:0)
所以我终于解决了这个问题,我正在为面临类似问题的任何人提出这个答案。这个错误是我处理重复状态的方式。如果生成的新状态是在搜索树中的其他位置之前生成的,那么它不应该被添加到状态表中,但是我犯的错误就是在它应该已经消失时找到重复状态时缩短预订遍历一。
简单地说:从下面的代码中删除else子句给我的状态数为6046:
#check if the new state has not been generated somewhere else in the search tree
if not self.check_duplicates(new_state):
self.generate_state_value_table(new_state, turn+1)
else:
return
此外,当我遇到一个有明显赢家的状态时,我停止探索搜索树分支。具体来说,我在self.add_state(state, winner/2 + 0.5)
之后添加了以下代码:
#check if the winner returned is one of the players and go back to the previous state if so
if winner != 0:
return
这给了我5762个州的数量,这正是我所寻找的。 p>