Question

我一直在尝试通过LVFA实施SARSA。到目前为止，我已经实现了以下代码，但是它似乎不起作用（即使是最简单的问题，也无法收敛到正确的Q因子）。

对于我的代码为何不起作用的任何帮助，将不胜感激！谢谢！

我已经实现了以下TD更新规则（根据我的理解，它是根据TDBE上的随机梯度下降得出的）： https://i.ibb.co/Jyt5DFd/Capture.png

我只是用上面的内容替换了Q中的TD更新规则，该规则应该起作用，因为从理论上来说，保证具有LVFA的SARSA可以收敛到真实定点（在一定范围内）。因此，我假设自己的实现存在问题，但是我还无法发现任何错误。

def SARSA_LVFA(theta, phi, r, s, gamma = 0.5, a = 0.005):
    """ SARSA algorithm with LVFA """
    limit = 10**5

    # choose action u from eps-greedy policy
    if np.random.random() < 0.9:
        u = np.argmax(np.matmul(phi(s).T, theta))
    else:
        u = np.random.choice(U)

    for i in range(limit):
        phi_s = phi(s)                            # get features for current state s
        s_ = np.random.choice(S, p = f(s, u))     # perform action u (noisy model)
        phi_s_ = phi(s_)                          # get features for new state s_

        # choose action u_ from eps-greedy policy
        if np.random.random() < 0.9:
            u_ = np.argmax(np.matmul(phi_s_.T, theta))
        else:
            u_ = np.random.choice(U)

        # caculate temporal difference delta
        td_target = r(s, u) + gamma*np.matmul(theta[:, u_].T, phi_s_)
        delta = td_target - np.matmul(theta[:, u].T, phi_s)

        # update feature weights
        theta[:, u] = theta[:, u] + a * delta * phi_s.T

        s = s_
        u = u_

    return theta

有关代码的一些注释：

U是操作空间，S是状态空间。
theta是形状为len( phi ) x len( U )的权重矩阵，其中phi是状态s的特征（列）矢量。
只需执行np.matmul(Phi.T, theta)，即可获得Q矩阵，其中Phi只是所有特征向量的集合[phi(s1) | phi(s2) | …| phi(sN)。
留下您可能在评论中遇到的其他问题！

让我们在玩具行跟随问题上尝试上述功能。对于状态空间S = [0, 1, 2]（行左，分别在行和行右）和动作空间U = [0, 1, 2]（分别在右，空闲和左）。采取以下奖励函数r，系统模型f和特征函数phi：

def r(x, u):
    """ reward function """
    if x == S[1]:
        return 1.0
    else:
        return 0.0

def f(x, u):
    '''
    list with probabilities for each successor is returned.
    All states are valid successors, but they can receive zero probability.
    '''
    if x == S[1]:       # on line
        if u == U[2]:   # left
            result = [0.2, 0.7, 0.1]
        elif u == U[0]: # right
            result = [0.1, 0.7, 0.2]
        elif u == U[1]: # none
            result = [0.0, 1.0, 0.0]
    elif x == S[0]:     # left of line
        if u == U[2]:
            result = [1.0,0.0,0.0]
        elif u == U[0]:
            result = [0.0,1.0,0.0]
        elif u == U[1]:
            result = [1.0, 0.0, 0.0]
    elif x == S[2]:     # right of line
        if u == U[2]:
            result = [0.0,1.0,0.0]
        elif u == U[0]:
            result = [0.0,0.0,1.0]
        elif u == U[1]:
            result = [0.0, 0.0, 1.0]
    return result

def phi1(s):
    if s == S[1]:
        return 1.0
    else:
        return 0.0

def phi2(s):
    if s != S[1]:
        return 1.0
    else:
        return 0.0

def phi(x):
    """ get features for state x """
    features = np.asarray([[phi1(x), phi2(x)]]).T
    return features

执行theta_optimal = SARSA_LVFA(theta, phi, r, some_start_state)会给您一个错误的Q矩阵，例如：

[[0.27982704 0.13408623 0.28761029]
 [1.71499981 1.98207434 1.72503455]
 [0.27982704 0.13408623 0.28761029]]

以及相应的错误策略[2 1 2]或有时为[0 1 0]。

我已经在简单的SARSA和Q学习（没有LVFA）上尝试了相同的玩具问题，并获得了正确的策略[0 1 2]和Q矩阵：

[[0.98987301 0.46667215 0.4698729 ]
 [1.80929669 1.98819096 1.8406385 ]
 [0.47045638 0.47047932 0.99035824]]

具有线性值函数的SARSA。约没有收敛到正确的Q因子

0 个答案: