如何确定何时解决CartPole环境?

时间:2019-02-17 21:44:55

标签: machine-learning pytorch reinforcement-learning openai-gym

我正在阅读this教程,并看到以下代码:

        # Calculate score to determine when the environment has been solved
        scores.append(time)
        mean_score = np.mean(scores[-100:])

        if episode % 50 == 0:
            print('Episode {}\tAverage length (last 100 episodes): {:.2f}'.format(
                episode, mean_score))

        if mean_score > env.spec.reward_threshold:
            print("Solved after {} episodes! Running average is now {}. Last episode ran to {} time steps."
                  .format(episode, mean_score, time))
            break

但是,这对我来说真的没有意义。如何定义“ RL环境已解决”的时间?不知道那意味着什么。我想在分类中将损失定义为零是有意义的。在回归分析中,也许总的l2损失小于某个值?当预期收益(折价报酬)大于某个值时,定义它也许是有意义的。

但是在这里似乎他们在计算时间步长?这对我来说毫无意义。


请注意original tutorial的用法是

def main(episodes):
    running_reward = 10
    for episode in range(episodes):
        state = env.reset() # Reset environment and record the starting state
        done = False       

        for time in range(1000):
            action = select_action(state)
            # Step through environment using chosen action
            state, reward, done, _ = env.step(action.data[0])
# Save reward
            policy.reward_episode.append(reward)
            if done:
                break

        # Used to determine when the environment is solved.
        running_reward = (running_reward * 0.99) + (time * 0.01)
update_policy()
if episode % 50 == 0:
            print('Episode {}\tLast length: {:5d}\tAverage length: {:.2f}'.format(episode, time, running_reward))
if running_reward > env.spec.reward_threshold:
            print("Solved! Running reward is now {} and the last episode runs to {} time steps!".format(running_reward, time))
            break

不确定这是否更有意义...

这仅是此环境/任务的一个特殊之处吗?任务通常如何结束?

2 个答案:

答案 0 :(得分:1)

equals the reward of the episode冒用的时间。平衡杆的时间越长,得分越高,停在某个最大时间值上。

因此,如果最后几集的运行平均值足够接近最大时间,则该集将被视为已解决。

答案 1 :(得分:0)

  

这只是该环境/任务的一个特殊之处吗?

是的。情节终止完全取决于各自的环境。

在100项连续试验中,平均奖励大于或等于195.0时,即视为可以解决CartPole挑战。

您的解决方案的性能取决于您的算法解决问题的速度。

有关Cartpole env的更多信息,请参阅此wiki

有关任何GYM环境的信息,请参阅此wiki