我试图实施Sarsa算法来解决OpenAI健身房的冰湖环境问题。我很快就开始使用它,但我想我明白了。
我也理解Sarsa算法是如何工作的,有许多网站在哪里可以找到伪代码,我得到了它。我已按照所有步骤在我的问题中实现了这个算法,但是当我在所有剧集之后检查最终的Q函数时,我注意到所有值都趋于零,我不知道原因。
这是我的代码,我希望有人可以告诉我为什么会这样。
import gym
import random
import numpy as np
env = gym.make('FrozenLake-v0')
#Initialize the Q matrix 16(rows)x4(columns)
Q = np.zeros([env.observation_space.n, env.action_space.n])
for i in range(env.observation_space.n):
if (i != 5) and (i != 7) and (i != 11) and (i != 12) and (i != 15):
for j in range(env.action_space.n):
Q[i,j] = np.random.rand()
#Epsilon-Greedy policy, given a state the agent chooses the action that it believes has the best long-term effect with probability 1-eps, otherwise, it chooses an action uniformly at random. Epsilon may change its value.
bestreward = 0
epsilon = 0.1
discount = 0.99
learning_rate = 0.1
num_episodes = 50000
a = [0,0,0,0,0,0,0,0,0,0]
for i_episode in range(num_episodes):
# Observe current state s
observation = env.reset()
currentState = observation
# Select action a using a policy based on Q
if np.random.rand() <= epsilon: #pick randomly
currentAction = random.randint(0,env.action_space.n-1)
else: #pick greedily
currentAction = np.argmax(Q[currentState, :])
totalreward = 0
while True:
env.render()
# Carry out an action a
observation, reward, done, info = env.step(currentAction)
if done is True:
break;
# Observe reward r and state s'
totalreward += reward
nextState = observation
# Select action a' using a policy based on Q
if np.random.rand() <= epsilon: #pick randomly
nextAction = random.randint(0,env.action_space.n-1)
else: #pick greedily
nextAction = np.argmax(Q[nextState, :])
# update Q with Q-learning
Q[currentState, currentAction] += learning_rate * (reward + discount * Q[nextState, nextAction] - Q[currentState, currentAction])
currentState = nextState
currentAction = nextAction
print "Episode: %d reward %d best %d epsilon %f" % (i_episode, totalreward, bestreward, epsilon)
if totalreward > bestreward:
bestreward = totalreward
if i_episode > num_episodes/2:
epsilon = epsilon * 0.9999
if i_episode >= num_episodes-10:
a.insert(0, totalreward)
a.pop()
print a
for i in range(env.observation_space.n):
print "-----"
for j in range(env.action_space.n):
print Q[i,j]
答案 0 :(得分:1)
当剧集结束时,您在更新Q功能之前打破了while循环。因此,当代理收到的奖励不等于零(代理已达到目标状态)时,Q函数永远不会更新。
你应该在while循环的最后一部分检查剧集的结尾。