Question

我正在尝试解决一个基本问题，使代理到达R ^ 2平面中的固定目标点。

为此，我使用的是RLlib库，该库提供了PPO算法的实现。

这是我的基本代码：

import numpy as np
import gym
from gym.spaces import Box

import ray
from ray import tune


class A2BEnv(gym.Env):
    def __init__(self, env_config):
        # goal/destination position for agent
        self.desired_state = np.array([0.0,0.0])
        # agent can move by up to 1 unit in any direction
        self.action_space = Box(low=-1.0, high=1.0, shape=(2,), dtype=np.float32)
        # agent's state is its position in R^2
        self.observation_space = Box(low=-np.inf, high=np.inf, shape=(2,), dtype=np.float32)


    def step(self, action):
        # agent moves
        self.state += action
        reward = - np.sqrt(np.linalg.norm(self.state - self.desired_state))
        done == reward > -0.3

        return (self.state, reward, done, {})

    def reset(self):
        # agent's initial position
        self.state = np.array([3.0,-3.0])  
        return self.state


if __name__ == "__main__":
    ray.init()
    tune.run(
        "PPO",
        config={
            "env": A2BEnv,
            "horizon": 200,
            "num_workers": 3,
            "lr": 1e-4,
            "gamma": 0.99,
        },
    )

我尝试了许多不同的方法，例如限制空间，使空间/动作离散，尝试不同的奖励（例如，每步-1，在目标位置为0）等，但我无法使其正常工作。在训练过程中，我经常获得最佳的最大奖励（即座席找到到达目的地的最佳路径），但是即使经过数千次迭代，平均奖励也没有提高。

我相信这是一个简单的问题，但我找不到解决方法。 PPO对这种问题不好吗？还是奖励应该不同？

使用PPO x RLlib解决A对B问题不起作用

0 个答案: