我使用来自UCL的David Silvers RL笔记提供的伪代码编写了一个简单的策略梯度方法的脚本。
我正在使用高斯政策,因为我要解决Open AI Gym的持续山地车问题。
如果我采用伽马高,即0.9,系统会发散。如果我的学习率更高,例如系统发散。 方差是否过高?高斯政策实施是错误的吗?
该衍生物为 - ∇θlogπθ(s,a)=(a - μ(s))φ(s)/σ^ 2
其中a~N(μ(s),σ^ 2)
我使用了以下伪代码: 基于行为价值评判的简单演员评价算法 使用线性值fn约。 Qw(s,a)=φ(s,a)
Critic Updates w by linear TD(0)
Actor Updates θ by policy gradient
function QAC
Initialise s, θ
Sample a ∼ πθ
for each step do
Sample reward r = R
sample transition s
0 ∼ Pa
Sample action a 0 ∼ πθ(s, a;0)
δ = r + γQw (s0, a0) − Qw (s, a)
θ = θ + α∇θ log πθ(s, a)Qw (s, a)
w ← w + βδφ(s, a)
a ← a0,s ← s0
end for
end function
这是我的代码我python
import numpy as np
import gym
import random
env = gym.make('MountainCarContinuous-v0')
observation = env.reset()
gamma = 0.8
alpha =0.1
beta = 0.2
theta = np.random.rand(2,1)
w = np.random.rand(3,1)
sigma = 1
## policy improvement - Critic
def Q_approx(state_action,w):
Q = np.dot(np.transpose(state_action),w)
return Q
## policy evaluation - Actor
def policy_approx(state,theta):
action = np.dot(np.transpose(state),theta)
return(action)
num_episodes = 20000
num_steps = 300
rt = np.zeros([num_steps,num_episodes])
a_save = np.zeros([num_steps,num_episodes])
state_history = np.zeros([num_steps,num_episodes])
for i in range(num_episodes):
state = env.reset()
e = (0.8/num_episodes) * i + 0.1
2
if(random.random() > e):
action = np.array([random.uniform(-2,2)])
else:
action = policy_approx(state,theta)
for j in range(num_steps):
## generate action
## take action
new_state,reward,done,info = env.step(action)
rt[j,i] = reward
a_save[j,i] = action
a = np.random.normal(action,sigma)
if(random.random() > e):
new_action = random.uniform(-2,2)
else:
new_action = policy_approx(state,theta)
new_action = policy_approx(new_state,theta)
state_action = [state[0],state[1],action[0]]
new_state_action = [new_state[0],new_state[1],new_action[0]]
## calculate temporal difference
state_history[j,i] = state[0]
delta = reward + gamma*Q_approx(new_state_action,w) -
Q_approx(state_action,w)
theta =[x + y for x,y in zip(theta,alpha * state *((a -
action)/sigma**2)*Q_approx(state_action,w))]
w = [x + y for x, y in zip(w, beta * delta * state_action)]
state = new_state
action = new_action
print(i,j)