Question

下面的注释代码块输出我想要的答案，而未注释代码块输出错误的答案。

有人可以阐明为什么两个代码块不同吗？ self.q的键应该是（状态，动作）对，那么self.q [state] [action]为什么起作用？ self.q不应该只接受一把钥匙吗？

    def update_q_value(self, state, action, old_q, reward, future_rewards):
        # Q-values are stored in the dictionary self.q. The keys of self.q should be in the form of (state, action) pairs, where state is a tuple of all piles sizes in order, and action is a tuple (i, j) representing a pile and a number.

        state_pair = (tuple(state), action)
        if state_pair not in self.q:
            self.q[state_pair] = dict()

        print(old_q + self.alpha * (reward + future_rewards - old_q))

        self.q[state_pair] = old_q + self.alpha * (reward + future_rewards - old_q)

        # state = tuple(state)
        # if state not in self.q:
        #     self.q[state] = dict()

        # print(old_q + self.alpha * (reward + future_rewards - old_q))

        # self.q[state][action] = old_q + self.alpha * (reward + future_rewards - old_q)

第一个块的输出如下：

Playing training game 1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-0.5
0.5
Playing training game 2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-0.75
0.75
...
Playing training game 9999
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-1.0
1.0
Playing training game 10000
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-1.0
1.0

第二个块的输出如下：

Playing training game 1
0.0
0.0
0.0
0.0
0.0
0.0
-0.5
0.5
Playing training game 2
0.0
0.0
0.0
0.0
0.0
0.0
-0.25
-0.5
0.5
...
Playing training game 9999
0.0625
0.125
0.125
0.125
0.25
0.25
-0.25
-0.5
0.5
Playing training game 10000
0.0625
0.125
0.125
0.125
0.25
0.25
-0.25
-0.5
0.5

如果有人愿意看的话，完整的代码在这里：https://d.pr/n/MKE8iH可以用以下代码运行：

ai = train(10000)
play(ai)

Answer 1

如注释中所述，self.q[state][action]之所以有用，是因为您正在创建另一个字典作为值，该字典具有action作为键。

class foo():
    def __init__(self):
        self.qTuple = {}
        self.qDict = {}

    def update_q_value_tuple(self, state, action, value):
        state_pair = (tuple(state), action)
        if state_pair not in self.qTuple:
            self.qTuple[state_pair] = dict()
        self.qTuple[state_pair] = value


    def update_q_value_dict(self, state, action, value):
        state = tuple(state)
        if state not in self.qDict:
            self.qDict[state] = dict()
        self.qDict[state][action] = value


f = foo()
states = ['foo', 'bar']
actions = ['hold', 'release']

for s in states:
    for a in actions:
        for v in range(0, 5):
            f.update_q_value_tuple(s, a, v)
            f.update_q_value_dict(s, a, v)

print f.qTuple
print f.qDict

输出：

{(('f', 'o', 'o'), 'hold'): 4, (('b', 'a', 'r'), 'hold'): 4, (('b', 'a', 'r'), 'release'): 4, (('f', 'o', 'o'), 'release'): 4}
{('f', 'o', 'o'): {'release': 4, 'hold': 4}, ('b', 'a', 'r'): {'release': 4, 'hold': 4}}

请注意，创建具有一个元素的元组时请务必小心，不要忘记尾随逗号：state = tuple(state, )

使用元组作为字典键

1 个答案: