Question

我使用来自UCL的David Silvers RL笔记提供的伪代码编写了一个简单的策略梯度方法的脚本。

我正在使用高斯政策，因为我要解决Open AI Gym的持续山地车问题。

如果我采用伽马高，即0.9，系统会发散。如果我的学习率更高，例如系统发散。方差是否过高？高斯政策实施是错误的吗？

该衍生物为 - ∇θlogπθ（s，a）=（a - μ（s））φ（s）/σ^ 2

其中a~N（μ（s），σ^ 2）

我使用了以下伪代码：基于行为价值评判的简单演员评价算法使用线性值fn约。 Qw（s，a）=φ（s，a）

Critic Updates w by linear TD(0)
Actor Updates θ by policy gradient
function QAC
Initialise s, θ
Sample a ∼ πθ
for each step do
Sample reward r = R
sample transition s
0 ∼ Pa

Sample action a 0 ∼ πθ(s, a;0)
δ = r + γQw (s0, a0) − Qw (s, a)
θ = θ + α∇θ log πθ(s, a)Qw (s, a)
w ← w + βδφ(s, a)
a ← a0,s ← s0
end for
end function

这是我的代码我python

import numpy as np
import gym
import random
env = gym.make('MountainCarContinuous-v0')

observation = env.reset()

gamma = 0.8
alpha  =0.1 
beta  = 0.2

theta = np.random.rand(2,1)
w = np.random.rand(3,1)
sigma  = 1

## policy improvement  - Critic
def Q_approx(state_action,w):
Q = np.dot(np.transpose(state_action),w)
return Q
## policy evaluation - Actor
def policy_approx(state,theta):
action = np.dot(np.transpose(state),theta)
return(action)

num_episodes = 20000
num_steps  = 300
rt =  np.zeros([num_steps,num_episodes]) 
a_save = np.zeros([num_steps,num_episodes])
state_history = np.zeros([num_steps,num_episodes])

for i in range(num_episodes):
state = env.reset()
e = (0.8/num_episodes) * i + 0.1
2
if(random.random() > e):
    action =  np.array([random.uniform(-2,2)])
else:
    action = policy_approx(state,theta)

for j in range(num_steps):
    ## generate action

    ## take action
    new_state,reward,done,info = env.step(action)
    rt[j,i] =  reward
    a_save[j,i] = action

    a = np.random.normal(action,sigma)
    if(random.random() > e):
         new_action =  random.uniform(-2,2)
    else:
        new_action = policy_approx(state,theta)
    new_action = policy_approx(new_state,theta)
    state_action = [state[0],state[1],action[0]]        
    new_state_action = [new_state[0],new_state[1],new_action[0]]
    ## calculate temporal difference
    state_history[j,i] = state[0]
    delta = reward + gamma*Q_approx(new_state_action,w) - 
     Q_approx(state_action,w)

    theta =[x + y for x,y in zip(theta,alpha *  state *((a - 
    action)/sigma**2)*Q_approx(state_action,w))]

    w = [x + y for x, y in zip(w, beta * delta * state_action)]


    state =  new_state
    action = new_action
    print(i,j)

在David Silver RL的Actor Critic框架中实现Policy Gradient算法

0 个答案: