我已经看到REINFORCE又名Vanilla策略算法的多个实现被用于具有离散动作空间的强化学习任务。连续动作空间是否有任何算法(或其他策略梯度算法)的实现?
更具体地说,是否可以实施REINFORCE用于双足运动 - 来自OpenAI Gym的“humanoid-v2”?
谢谢。
答案 0 :(得分:0)
您可以稳定基准软件包:https://github.com/hill-a/stable-baselines
培训代理商很简单:
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2
my_env_id = 'Humanoid-v2'
env = gym.make(my_env_id)
# Vectorized environments allow to easily multiprocess training
# we demonstrate its usefulness in the next examples
env = DummyVecEnv([lambda: env]) # The algorithms require a vectorized environment to run
model = PPO2(MlpPolicy, env, verbose=1)
# Train the agent
model.learn(total_timesteps=10000)
# Enjoy trained agent
obs = env.reset()
for i in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()