在David Silver RL的Actor Critic框架中实现Policy Gradient算法

时间:2017-10-22 10:10:09

标签: python reinforcement-learning

我使用来自UCL的David Silvers RL笔记提供的伪代码编写了一个简单的策略梯度方法的脚本。

我正在使用高斯政策,因为我要解决Open AI Gym的持续山地车问题。

如果我采用伽马高,即0.9,系统会发散。如果我的学习率更高,例如系统发散。 方差是否过高?高斯政策实施是错误的吗?

该衍生物为 - ∇θlogπθ(s,a)=(a - μ(s))φ(s)/σ^ 2

其中a~N(μ(s),σ^ 2)

我使用了以下伪代码: 基于行为价值评判的简单演员评价算法 使用线性值fn约。 Qw(s,a)=φ(s,a)

Critic Updates w by linear TD(0)
Actor Updates θ by policy gradient
function QAC
Initialise s, θ
Sample a ∼ πθ
for each step do
Sample reward r = R
sample transition s
0 ∼ Pa

Sample action a 0 ∼ πθ(s, a;0)
δ = r + γQw (s0, a0) − Qw (s, a)
θ = θ + α∇θ log πθ(s, a)Qw (s, a)
w ← w + βδφ(s, a)
a ← a0,s ← s0
end for
end function

这是我的代码我python

import numpy as np
import gym
import random
env = gym.make('MountainCarContinuous-v0')

observation = env.reset()

gamma = 0.8
alpha  =0.1 
beta  = 0.2

theta = np.random.rand(2,1)
w = np.random.rand(3,1)
sigma  = 1

## policy improvement  - Critic
def Q_approx(state_action,w):
Q = np.dot(np.transpose(state_action),w)
return Q
## policy evaluation - Actor
def policy_approx(state,theta):
action = np.dot(np.transpose(state),theta)
return(action)

num_episodes = 20000
num_steps  = 300
rt =  np.zeros([num_steps,num_episodes]) 
a_save = np.zeros([num_steps,num_episodes])
state_history = np.zeros([num_steps,num_episodes])

for i in range(num_episodes):
state = env.reset()
e = (0.8/num_episodes) * i + 0.1
2
if(random.random() > e):
    action =  np.array([random.uniform(-2,2)])
else:
    action = policy_approx(state,theta)

for j in range(num_steps):
    ## generate action

    ## take action
    new_state,reward,done,info = env.step(action)
    rt[j,i] =  reward
    a_save[j,i] = action

    a = np.random.normal(action,sigma)
    if(random.random() > e):
         new_action =  random.uniform(-2,2)
    else:
        new_action = policy_approx(state,theta)
    new_action = policy_approx(new_state,theta)
    state_action = [state[0],state[1],action[0]]        
    new_state_action = [new_state[0],new_state[1],new_action[0]]
    ## calculate temporal difference
    state_history[j,i] = state[0]
    delta = reward + gamma*Q_approx(new_state_action,w) - 
     Q_approx(state_action,w)

    theta =[x + y for x,y in zip(theta,alpha *  state *((a - 
    action)/sigma**2)*Q_approx(state_action,w))]

    w = [x + y for x, y in zip(w, beta * delta * state_action)]


    state =  new_state
    action = new_action
    print(i,j)

0 个答案:

没有答案