前言
我试图解决this Windy-Grid-World环境。实现了Q和Q(λ)算法,结果几乎相同(我看每集的步骤)。
问题:
从我读过的内容来看,我认为更高的lambda参数应该更多地更新状态,然后再导致它;因此,与常规Q学习相比,步骤的数量应该大大减少。这个image显示了我在说什么。
这个环境是正常的还是我错误地实现了?
代码:
import matplotlib.pyplot as plt
import numpy as np
from lib.envs.windy_gridworld import WindyGridworldEnv
from collections import defaultdict
env = WindyGridworldEnv()
def epsilon_greedy_policy(Q, state, nA, epsilon):
'''
Create a policy in which epsilon dictates how likely it will
take a random action.
:param Q: links state -> action value (dictionary)
:param state: state character is in (int)
:param nA: number of actions (int)
:param epsilon: chance it will take a random move (float)
:return: probability of each action to be taken (list)
'''
probs = np.ones(nA) * epsilon / nA
best_action = np.argmax(Q[state])
probs[best_action] += 1.0 - epsilon
return probs
def Q_learning_lambda(episodes, learning_rate, discount, epsilon, _lambda):
'''
Learns to solve the environment using Q(λ)
:param episodes: Number of episodes to run (int)
:param learning_rate: How fast it will converge to a point (float [0, 1])
:param discount: How much future events lose their value (float [0, 1])
:param epsilon: chance a random move is selected (float [0, 1])
:param _lambda: How much credit to give states leading up to reward (float [0, 1])
:return: x,y points to graph
'''
# Link state to action values
Q = defaultdict(lambda: np.zeros(env.action_space.n))
# Eligibility trace
e = defaultdict(lambda: np.zeros(env.action_space.n))
# Points to plot
# number of episodes
x = np.arange(episodes)
# number of steps
y = np.zeros(episodes)
for episode in range(episodes):
state = env.reset()
# Select action
probs = epsilon_greedy_policy(Q, state, env.action_space.n, epsilon)
action = np.random.choice(len(probs), p=probs)
for step in range(10000):
# Take action
next_state, reward, done, _ = env.step(action)
# Select next action
probs = epsilon_greedy_policy(Q, next_state, env.action_space.n, epsilon)
next_action = np.random.choice(len(probs), p=probs)
# Get update value
best_next_action = np.argmax(Q[next_state])
td_target = reward + discount * Q[next_state][best_next_action]
td_error = td_target - Q[state][action]
e[state][action] += 1
# Update all states
for s in Q:
for a in range(len(Q[s])):
# Update Q value based on eligibility trace
Q[s][a] += learning_rate * td_error * e[s][a]
# Decay eligibility trace if best action is taken
if next_action is best_next_action:
e[s][a] = discount * _lambda * e[s][a]
# Reset eligibility trace if random action taken
else:
e[s][a] = 0
if done:
y[episode] = step
e.clear()
break
# Update action and state
action = next_action
state = next_state
return x, y
如果您想查看整件事,可以查看我的Jupyter Notebook here。
答案 0 :(得分:0)
您的实现没有问题。
您为Q(λ)实现的是Watkins版本的Q(λ)。在他的版本中,对于非贪婪的行为,资格跟踪将为零,而对于贪婪的行为,则仅进行备份。如eligibility traces(p25)中所述,沃特金斯(Watkins)Q(λ)的缺点在于,在早期学习中,合格跟踪将被频繁地“削减”(归零),从而对跟踪没有太大帮助。也许这就是为什么您的Q学习和Q(λ)学习具有非常相似的性能的原因。
您可以尝试其他资格跟踪,例如Peng的资格或天真的资格,以检查表演是否有任何提升。