我一直在尝试通过LVFA实施SARSA。到目前为止,我已经实现了以下代码,但是它似乎不起作用(即使是最简单的问题,也无法收敛到正确的Q因子)。
对于我的代码为何不起作用的任何帮助,将不胜感激!谢谢!
我已经实现了以下TD更新规则(根据我的理解,它是根据TDBE上的随机梯度下降得出的): https://i.ibb.co/Jyt5DFd/Capture.png
我只是用上面的内容替换了Q中的TD更新规则,该规则应该起作用,因为从理论上来说,保证具有LVFA的SARSA可以收敛到真实定点(在一定范围内)。因此,我假设自己的实现存在问题,但是我还无法发现任何错误。
def SARSA_LVFA(theta, phi, r, s, gamma = 0.5, a = 0.005):
""" SARSA algorithm with LVFA """
limit = 10**5
# choose action u from eps-greedy policy
if np.random.random() < 0.9:
u = np.argmax(np.matmul(phi(s).T, theta))
else:
u = np.random.choice(U)
for i in range(limit):
phi_s = phi(s) # get features for current state s
s_ = np.random.choice(S, p = f(s, u)) # perform action u (noisy model)
phi_s_ = phi(s_) # get features for new state s_
# choose action u_ from eps-greedy policy
if np.random.random() < 0.9:
u_ = np.argmax(np.matmul(phi_s_.T, theta))
else:
u_ = np.random.choice(U)
# caculate temporal difference delta
td_target = r(s, u) + gamma*np.matmul(theta[:, u_].T, phi_s_)
delta = td_target - np.matmul(theta[:, u].T, phi_s)
# update feature weights
theta[:, u] = theta[:, u] + a * delta * phi_s.T
s = s_
u = u_
return theta
有关代码的一些注释:
U
是操作空间,S
是状态空间。theta
是形状为len( phi ) x len( U )
的权重矩阵,其中phi
是状态s
的特征(列)矢量。np.matmul(Phi.T, theta)
,即可获得Q矩阵,其中Phi
只是所有特征向量的集合[phi(s1)
| phi(s2)
| …| phi(sN)
。让我们在玩具行跟随问题上尝试上述功能。对于状态空间S = [0, 1, 2]
(行左,分别在行和行右)和动作空间U = [0, 1, 2]
(分别在右,空闲和左)。采取以下奖励函数r
,系统模型f
和特征函数phi
:
def r(x, u):
""" reward function """
if x == S[1]:
return 1.0
else:
return 0.0
def f(x, u):
'''
list with probabilities for each successor is returned.
All states are valid successors, but they can receive zero probability.
'''
if x == S[1]: # on line
if u == U[2]: # left
result = [0.2, 0.7, 0.1]
elif u == U[0]: # right
result = [0.1, 0.7, 0.2]
elif u == U[1]: # none
result = [0.0, 1.0, 0.0]
elif x == S[0]: # left of line
if u == U[2]:
result = [1.0,0.0,0.0]
elif u == U[0]:
result = [0.0,1.0,0.0]
elif u == U[1]:
result = [1.0, 0.0, 0.0]
elif x == S[2]: # right of line
if u == U[2]:
result = [0.0,1.0,0.0]
elif u == U[0]:
result = [0.0,0.0,1.0]
elif u == U[1]:
result = [0.0, 0.0, 1.0]
return result
def phi1(s):
if s == S[1]:
return 1.0
else:
return 0.0
def phi2(s):
if s != S[1]:
return 1.0
else:
return 0.0
def phi(x):
""" get features for state x """
features = np.asarray([[phi1(x), phi2(x)]]).T
return features
执行theta_optimal = SARSA_LVFA(theta, phi, r, some_start_state)
会给您一个错误的Q矩阵,例如:
[[0.27982704 0.13408623 0.28761029]
[1.71499981 1.98207434 1.72503455]
[0.27982704 0.13408623 0.28761029]]
以及相应的错误策略[2 1 2]
或有时为[0 1 0]
。
我已经在简单的SARSA和Q学习(没有LVFA)上尝试了相同的玩具问题,并获得了正确的策略[0 1 2]
和Q矩阵:
[[0.98987301 0.46667215 0.4698729 ]
[1.80929669 1.98819096 1.8406385 ]
[0.47045638 0.47047932 0.99035824]]