我正在尝试实施蒙特卡罗树搜索以在Python中播放井字游戏。我目前的实施如下:
我有一个Board类来处理tic-tac-toe board的改动。电路板的状态由2x3x3 numpy数组表示,其中2个3x3矩阵中的每一个都是二进制矩阵,分别代表X的存在和O的存在。
class Board:
'''
class handling state of the board
'''
def __init__(self):
self.state = np.zeros([2,3,3])
self.player = 0 # current player's turn
def copy(self):
'''
make copy of the board
'''
copy = Board()
copy.player = self.player
copy.state = np.copy(self.state)
return copy
def move(self, move):
'''
take move of form [x,y] and play
the move for the current player
'''
if np.any(self.state[:,move[0],move[1]]): return
self.state[self.player][move[0],move[1]] = 1
self.player ^= 1
def get_moves(self):
'''
return remaining possible board moves
(ie where there are no O's or X's)
'''
return np.argwhere(self.state[0]+self.state[1]==0).tolist()
def result(self):
'''
check rows, columns, and diagonals
for sequence of 3 X's or 3 O's
'''
board = self.state[self.player^1]
col_sum = np.any(np.sum(board,axis=0)==3)
row_sum = np.any(np.sum(board,axis=1)==3)
d1_sum = np.any(np.trace(board)==3)
d2_sum = np.any(np.trace(np.flip(board,1))==3)
return col_sum or row_sum or d1_sum or d2_sum
然后我有一个Node类,它在构建搜索树时处理节点的属性:
class Node:
'''
maintains state of nodes in
the monte carlo search tree
'''
def __init__(self, parent=None, action=None, board=None):
self.parent = parent
self.board = board
self.children = []
self.wins = 0
self.visits = 0
self.untried_actions = board.get_moves()
self.action = action
def select(self):
'''
select child of node with
highest UCB1 value
'''
s = sorted(self.children, key=lambda c:c.wins/c.visits+0.2*sqrt(2*log(self.visits)/c.visits))
return s[-1]
def expand(self, action, board):
'''
expand parent node (self) by adding child
node with given action and state
'''
child = Node(parent=self, action=action, board=board)
self.untried_actions.remove(action)
self.children.append(child)
return child
def update(self, result):
self.visits += 1
self.wins += result
最后,我有UCT功能将所有内容拉到一起。此函数接受Board对象并构建蒙特卡罗搜索树,以确定从给定的板状态开始的下一个最佳移动:
def UCT(rootstate, maxiters):
root = Node(board=rootstate)
for i in range(maxiters):
node = root
board = rootstate.copy()
# selection - select best child if parent fully expanded and not terminal
while node.untried_actions == [] and node.children != []:
node = node.select()
board.move(node.action)
# expansion - expand parent to a random untried action
if node.untried_actions != []:
a = random.choice(node.untried_actions)
board.move(a)
node = node.expand(a, board.copy())
# simulation - rollout to terminal state from current
# state using random actions
while board.get_moves() != [] and not board.result():
board.move(random.choice(board.get_moves()))
# backpropagation - propagate result of rollout game up the tree
# reverse the result if player at the node lost the rollout game
while node != None:
result = board.result()
if result:
if node.board.player==board.player:
result = 1
else: result = -1
else: result = 0
node.update(result)
node = node.parent
s = sorted(root.children, key=lambda c:c.wins/c.visits)
return s[-1].action
我已经搜索了这段代码几个小时,根本无法在我的实现中找到错误。我已经测试了很多电路板状态并且使两个代理相互对抗,但即使是最简单的电路板状态,该函数也会返回不良的操作。我缺少什么和/或我的实施有什么问题?
编辑:以下是如何实施两个代理的示例:
b = Board() # instantiate board
# while there are moves left to play and neither player has won
while b.get_moves() != [] and not b.result():
a = UCT(b,1000) # get next best move
b.move(a) # make move
print(b.state) # show state
答案 0 :(得分:3)
问题似乎如下:
get_moves()
功能不会检查游戏是否已经结束。它可以为某人已经获胜的州生成一个非空的移动列表。Node
时,您也不会检查游戏状态是否已经结束,因此会创建一个非空的untried_actions
列表。 result()
可以返回错误的获胜者。它只是检查最近一次移动的玩家是否赢了,如果你在某人获胜后立即停止游戏,这是正确的,但如果你继续玩某人已经获胜,则可能是错误的。因此,您通过树传播各种不正确的结果。解决此问题的最简单方法是修改get_moves()
,以便在游戏结束时返回一个空列表。然后,这些节点总是会失败if node.untried_actions != []
检查,这意味着完全跳过扩展阶段,然后直接进入 Play-out 阶段,在那里对终端进行适当的检查游戏状态。这可以按如下方式完成:
def get_moves(self):
"""
return remaining possible board moves
(ie where there are no O's or X's)
"""
if self.result():
return []
return np.argwhere(self.state[0] + self.state[1] == 0).tolist()