TLDR
MCTS代理实施在本地运行而没有错误,实现了 启发式驱动的极小极大赢率大于40%,但未能 自动分级机-这是项目可以进行之前的要求 已提交。自动分级机会抛出
IndexError: Cannot choose from an empty sequence
。我在寻找代码方面的建议 最有可能引发此异常。
您好,我目前停留在这个项目上,我需要在2周的时间内完成所注册的程序之前先清除该项目。我已经完成的任务是在两个国际象棋骑士之间的隔离游戏中实现一个代理,以与启发式驱动的minimax代理进行对抗。有关游戏的完整实施详细信息,请参见here。对于我的项目,将使用位板编码在9 x 11的板上玩游戏。我的MCTS实现非常简单,紧紧遵循此paper(第6页)中提供的伪代码。
本质上,通用的MCTS方法包括这4个部分,它们分别由CustomPlayer类中的以下嵌套函数实现:
反向传播-backup_negamax,update_scores
import math
import random
import time
import logging
from copy import deepcopy
from collections import namedtuple
from sample_players import DataPlayer
class CustomPlayer(DataPlayer):
""" Implement your own agent to play knight's Isolation
The get_action() method is the only required method for this project.
You can modify the interface for get_action by adding named parameters
with default values, but the function MUST remain compatible with the
default interface.
**********************************************************************
NOTES:
- The test cases will NOT be run on a machine with GPU access, nor be
suitable for using any other machine learning techniques.
- You can pass state forward to your agent on the next turn by assigning
any pickleable object to the self.context attribute.
**********************************************************************
"""
def get_action(self, state):
""" Employ an adversarial search technique to choose an action
available in the current state calls self.queue.put(ACTION) at least
This method must call self.queue.put(ACTION) at least once, and may
call it as many times as you want; the caller will be responsible
for cutting off the function after the search time limit has expired.
See RandomPlayer and GreedyPlayer in sample_players for more examples.
**********************************************************************
NOTE:
- The caller is responsible for cutting off search, so calling
get_action() from your own code will create an infinite loop!
Refer to (and use!) the Isolation.play() function to run games.
**********************************************************************
"""
logging.info("Move %s" % state.ply_count)
self.queue.put(random.choice(state.actions()))
i = 1
statlist = []
while (self.queue._TimedQueue__stop_time - 0.05) > time.perf_counter():
next_action = self.uct_search(state, statlist, i)
self.queue.put(next_action)
i += 1
def uct_search(self, state, statlist, i):
plyturn = state.ply_count % 2
Stat = namedtuple('Stat', 'state action utility visit nround')
def tree_policy(state):
statecopy = deepcopy(state)
while not statecopy.terminal_test():
# All taken actions at this depth
tried = [s.action for s in statlist if s.state == statecopy]
# See if there's any untried actions left
untried = [a for a in statecopy.actions() if a not in tried]
topop = []
toappend = []
if len(untried) > 0:
next_action = random.choice(untried)
statecopy = expand(statecopy, next_action)
break
else:
next_action = best_child(statecopy, 1)
for k, s in enumerate(statlist):
if s.state == statecopy and s.action == next_action:
visit1 = statlist[k].visit + 1
news = statlist[k]._replace(visit=visit1)
news = news._replace(nround=i)
topop.append(k)
toappend.append(news)
break
update_scores(topop, toappend)
statecopy = statecopy.result(next_action)
return statecopy
def expand(state, action):
"""
Returns a state resulting from taking an action from the list of untried nodes
"""
statlist.append(Stat(state, action, 0, 1, i))
return state.result(action)
def best_child(state, c):
"""
Returns the state resulting from taking the best action. c value between 0 (max score) and 1 (prioritize exploration)
"""
# All taken actions at this depth
tried = [s for s in statlist if s.state == state]
maxscore = -999
maxaction = []
# Compute the score
for t in tried:
score = (t.utility/t.visit) + c * math.sqrt(2 * math.log(i)/t.visit)
if score > maxscore:
maxscore = score
del maxaction[:]
maxaction.append(t.action)
elif score == maxscore:
maxaction.append(t.action)
if len(maxaction) < 1:
logging.error("IndexError: maxaction is empty!")
return random.choice(maxaction)
def default_policy(state):
"""
The simulation to run when visiting unexplored nodes. Defaults to uniform random moves
"""
while not state.terminal_test():
state = state.result(random.choice(state.actions()))
delta = state.utility(self.player_id)
if abs(delta) == float('inf') and delta < 0:
delta = -1
elif abs(delta) == float('inf') and delta > 0:
delta = 1
return delta
def backup_negamax(delta):
"""
Propagates the terminal utility up the search tree
"""
topop = []
toappend = []
for k, s in enumerate(statlist):
if s.nround == i:
if s.state.ply_count % 2 == plyturn:
utility1 = s.utility + delta
news = statlist[k]._replace(utility=utility1)
elif s.state.ply_count % 2 != plyturn:
utility1 = s.utility - delta
news = statlist[k]._replace(utility=utility1)
topop.append(k)
toappend.append(news)
update_scores(topop, toappend)
return
def update_scores(topop, toappend):
# Remove outdated tuples. Order needs to be in reverse or pop will fail!
for p in sorted(topop, reverse=True):
statlist.pop(p)
# Add the updated ones
for a in toappend:
statlist.append(a)
return
next_state = tree_policy(state)
if not next_state.terminal_test():
delta = default_policy(next_state)
backup_negamax(delta)
return best_child(state, 0)
缺少颜色格式确实使代码很难阅读。因此,请随时在我的github上进行检查。
我在本地运行游戏没有问题,我的MCTS代理人对minimax玩家的赢率超过40%(在150ms /移动限制下)。但是,当我尝试将代码提交给自动分级机时,它会被IndexError: Cannot choose from an empty sequence
异常拒绝。
根据我对课程表述的讨论,我们认为该错误很可能是由于使用random.choice()
引起的。在我的实现中有4个使用它的实例:
我认为游戏的实现是正确的,并且只要状态为终端,调用state.actions()将始终返回可能动作的列表。因此,唯一可触发此异常的实例是项目3。项目1和4只是从可用操作中随机选择,同时进行了显式检查以确保random.choice()不会被提供空列表。因此,我将日志记录应用于了第3项(即使在本地运行时也未引发任何异常),并且可以肯定的是,在50场比赛之后没有捕获任何异常。
对于冗长的帖子,我深表歉意,但我确实希望那里的人能够发现我在实现过程中错过的一些事情。
答案 0 :(得分:1)
我建议为您具有的每个功能编写单元测试。验证您对代码行为的假设。尝试全面测试,包括极端情况。通常只需输入抽象测试数据即可揭示很多有关解决方案的体系结构和详细信息。
此外,您可以尝试让您的经纪人玩任何其他合适的游戏。这些步骤将为您提供一个很好的机会,以发现代码中的错误,并使其更适合将来重用。