下面的注释代码块输出我想要的答案,而未注释代码块输出错误的答案。
有人可以阐明为什么两个代码块不同吗? self.q的键应该是(状态,动作)对,那么self.q [state] [action]为什么起作用? self.q不应该只接受一把钥匙吗?
def update_q_value(self, state, action, old_q, reward, future_rewards):
# Q-values are stored in the dictionary self.q. The keys of self.q should be in the form of (state, action) pairs, where state is a tuple of all piles sizes in order, and action is a tuple (i, j) representing a pile and a number.
state_pair = (tuple(state), action)
if state_pair not in self.q:
self.q[state_pair] = dict()
print(old_q + self.alpha * (reward + future_rewards - old_q))
self.q[state_pair] = old_q + self.alpha * (reward + future_rewards - old_q)
# state = tuple(state)
# if state not in self.q:
# self.q[state] = dict()
# print(old_q + self.alpha * (reward + future_rewards - old_q))
# self.q[state][action] = old_q + self.alpha * (reward + future_rewards - old_q)
第一个块的输出如下:
Playing training game 1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-0.5
0.5
Playing training game 2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-0.75
0.75
...
Playing training game 9999
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-1.0
1.0
Playing training game 10000
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-1.0
1.0
第二个块的输出如下:
Playing training game 1
0.0
0.0
0.0
0.0
0.0
0.0
-0.5
0.5
Playing training game 2
0.0
0.0
0.0
0.0
0.0
0.0
-0.25
-0.5
0.5
...
Playing training game 9999
0.0625
0.125
0.125
0.125
0.25
0.25
-0.25
-0.5
0.5
Playing training game 10000
0.0625
0.125
0.125
0.125
0.25
0.25
-0.25
-0.5
0.5
如果有人愿意看的话,完整的代码在这里:https://d.pr/n/MKE8iH可以用以下代码运行:
ai = train(10000)
play(ai)
答案 0 :(得分:0)
如注释中所述,self.q[state][action]
之所以有用,是因为您正在创建另一个字典作为值,该字典具有action
作为键。
class foo():
def __init__(self):
self.qTuple = {}
self.qDict = {}
def update_q_value_tuple(self, state, action, value):
state_pair = (tuple(state), action)
if state_pair not in self.qTuple:
self.qTuple[state_pair] = dict()
self.qTuple[state_pair] = value
def update_q_value_dict(self, state, action, value):
state = tuple(state)
if state not in self.qDict:
self.qDict[state] = dict()
self.qDict[state][action] = value
f = foo()
states = ['foo', 'bar']
actions = ['hold', 'release']
for s in states:
for a in actions:
for v in range(0, 5):
f.update_q_value_tuple(s, a, v)
f.update_q_value_dict(s, a, v)
print f.qTuple
print f.qDict
输出:
{(('f', 'o', 'o'), 'hold'): 4, (('b', 'a', 'r'), 'hold'): 4, (('b', 'a', 'r'), 'release'): 4, (('f', 'o', 'o'), 'release'): 4}
{('f', 'o', 'o'): {'release': 4, 'hold': 4}, ('b', 'a', 'r'): {'release': 4, 'hold': 4}}
请注意,创建具有一个元素的元组时请务必小心,不要忘记尾随逗号:state = tuple(state, )