具有线性值函数的SARSA。约没有收敛到正确的Q因子

时间:2019-04-17 13:33:58

标签: reinforcement-learning q-learning sarsa

我一直在尝试通过LVFA实施SARSA。到目前为止,我已经实现了以下代码,但是它似乎不起作用(即使是最简单的问题,也无法收敛到正确的Q因子)。

对于我的代码为何不起作用的任何帮助,将不胜感激!谢谢!

我已经实现了以下TD更新规则(根据我的理解,它是根据TDBE上的随机梯度下降得出的): https://i.ibb.co/Jyt5DFd/Capture.png

我只是用上面的内容替换了Q中的TD更新规则,该规则应该起作用,因为从理论上来说,保证具有LVFA的SARSA可以收敛到真实定点(在一定范围内)。因此,我假设自己的实现存在问题,但是我还无法发现任何错误。

def SARSA_LVFA(theta, phi, r, s, gamma = 0.5, a = 0.005):
    """ SARSA algorithm with LVFA """
    limit = 10**5

    # choose action u from eps-greedy policy
    if np.random.random() < 0.9:
        u = np.argmax(np.matmul(phi(s).T, theta))
    else:
        u = np.random.choice(U)

    for i in range(limit):
        phi_s = phi(s)                            # get features for current state s
        s_ = np.random.choice(S, p = f(s, u))     # perform action u (noisy model)
        phi_s_ = phi(s_)                          # get features for new state s_

        # choose action u_ from eps-greedy policy
        if np.random.random() < 0.9:
            u_ = np.argmax(np.matmul(phi_s_.T, theta))
        else:
            u_ = np.random.choice(U)

        # caculate temporal difference delta
        td_target = r(s, u) + gamma*np.matmul(theta[:, u_].T, phi_s_)
        delta = td_target - np.matmul(theta[:, u].T, phi_s)

        # update feature weights
        theta[:, u] = theta[:, u] + a * delta * phi_s.T

        s = s_
        u = u_

    return theta

有关代码的一些注释:

  • U是操作空间,S是状态空间。
  • theta是形状为len( phi ) x len( U )的权重矩阵,其中phi是状态s的特征(列)矢量。
  • 只需执行np.matmul(Phi.T, theta),即可获得Q矩阵,其中Phi只是所有特征向量的集合[phi(s1) | phi(s2) | …| phi(sN)
  • 留下您可能在评论中遇到的其他问题!

让我们在玩具行跟随问题上尝试上述功能。对于状态空间S = [0, 1, 2](行左,分别在行和行右)和动作空间U = [0, 1, 2](分别在右,空闲和左)。采取以下奖励函数r,系统模型f和特征函数phi

def r(x, u):
    """ reward function """
    if x == S[1]:
        return 1.0
    else:
        return 0.0

def f(x, u):
    '''
    list with probabilities for each successor is returned.
    All states are valid successors, but they can receive zero probability.
    '''
    if x == S[1]:       # on line
        if u == U[2]:   # left
            result = [0.2, 0.7, 0.1]
        elif u == U[0]: # right
            result = [0.1, 0.7, 0.2]
        elif u == U[1]: # none
            result = [0.0, 1.0, 0.0]
    elif x == S[0]:     # left of line
        if u == U[2]:
            result = [1.0,0.0,0.0]
        elif u == U[0]:
            result = [0.0,1.0,0.0]
        elif u == U[1]:
            result = [1.0, 0.0, 0.0]
    elif x == S[2]:     # right of line
        if u == U[2]:
            result = [0.0,1.0,0.0]
        elif u == U[0]:
            result = [0.0,0.0,1.0]
        elif u == U[1]:
            result = [0.0, 0.0, 1.0]
    return result

def phi1(s):
    if s == S[1]:
        return 1.0
    else:
        return 0.0

def phi2(s):
    if s != S[1]:
        return 1.0
    else:
        return 0.0

def phi(x):
    """ get features for state x """
    features = np.asarray([[phi1(x), phi2(x)]]).T
    return features

执行theta_optimal = SARSA_LVFA(theta, phi, r, some_start_state)会给您一个错误的Q矩阵,例如:

[[0.27982704 0.13408623 0.28761029]
 [1.71499981 1.98207434 1.72503455]
 [0.27982704 0.13408623 0.28761029]]

以及相应的错误策略[2 1 2]或有时为[0 1 0]

我已经在简单的SARSA和Q学习(没有LVFA)上尝试了相同的玩具问题,并获得了正确的策略[0 1 2]和Q矩阵:

[[0.98987301 0.46667215 0.4698729 ]
 [1.80929669 1.98819096 1.8406385 ]
 [0.47045638 0.47047932 0.99035824]]

0 个答案:

没有答案