使用简单平均进行强化学习

时间:2019-09-20 17:59:07

标签: python reinforcement-learning

我目前正在阅读Reinforcement Learning: An Introduction(RL:AI),并尝试重现第一个带有n型武装匪徒和简单平均奖励的示例。

平均

new_estimate = current_estimate + 1.0 / step * (reward - current_estimate)

为了从PDF复制图表,我生成了2000次强盗游戏,让不同的特工以1000个步骤播放2000年的强盗(如PDF中所述),然后平均奖励以及最佳操作的百分比。

在PDF 中,结果如下:

enter image description here

但是,我无法重现此内容。如果我使用简单平均,则所有具有探索性(epsilon > 0)的特工实际上要比没有探索性的特工更糟糕。这很奇怪,因为探索的可能性应该允许代理人更频繁地离开局部最优位置,并采取更好的行动。

正如您在下面看到的,对于我的实现,情况并非如此。还要注意,我添加了使用加权平均的代理。这些工作正常,但即使在那种情况下,提高epsilon也会导致代理性能下降。

enter image description here

任何想法我的代码有什么问题吗?

代码(MVP)

from abc import ABC
from typing import List

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from multiprocessing.pool import Pool


class Strategy(ABC):

    def update_estimates(self, step: int, estimates: np.ndarray, action: int, reward: float):
        raise NotImplementedError()


class Averaging(Strategy):

    def __str__(self):
        return 'avg'

    def update_estimates(self, step: int, estimates: np.ndarray, action: int, reward: float):
        current = estimates[action]
        return current + 1.0 / step * (reward - current)


class WeightedAveraging(Strategy):

    def __init__(self, alpha):
        self.alpha = alpha

    def __str__(self):
        return 'weighted-avg_alpha=%.2f' % self.alpha

    def update_estimates(self, step: int, estimates: List[float], action: int, reward: float):
        current = estimates[action]
        return current + self.alpha * (reward - current)


class Agent:

    def __init__(self, nb_actions, epsilon, strategy: Strategy):
        self.nb_actions = nb_actions
        self.epsilon = epsilon
        self.estimates = np.zeros(self.nb_actions)
        self.strategy = strategy

    def __str__(self):
        return ','.join(['eps=%.2f' % self.epsilon, str(self.strategy)])

    def get_action(self):
        best_known = np.argmax(self.estimates)
        if np.random.rand() < self.epsilon and len(self.estimates) > 1:
            explore = best_known
            while explore == best_known:
                explore = np.random.randint(0, len(self.estimates))
            return explore
        return best_known

    def update_estimates(self, step, action, reward):
        self.estimates[action] = self.strategy.update_estimates(step, self.estimates, action, reward)

    def reset(self):
        self.estimates = np.zeros(self.nb_actions)


def play_bandit(agent, nb_arms, nb_steps):

    agent.reset()

    bandit_rewards = np.random.normal(0, 1, nb_arms)

    rewards = list()
    optimal_actions = list()

    for step in range(1, nb_steps + 1):

        action = agent.get_action()
        reward = bandit_rewards[action] + np.random.normal(0, 1)
        agent.update_estimates(step, action, reward)

        rewards.append(reward)
        optimal_actions.append(np.argmax(bandit_rewards) == action)

    return pd.DataFrame(dict(
        optimal_actions=optimal_actions,
        rewards=rewards
    ))


def main():
    nb_tasks = 2000
    nb_steps = 1000
    nb_arms = 10

    fig, (ax_rewards, ax_optimal) = plt.subplots(2, 1, sharex='col', figsize=(8, 9))

    pool = Pool()

    agents = [
        Agent(nb_actions=nb_arms, epsilon=0.00, strategy=Averaging()),
        Agent(nb_actions=nb_arms, epsilon=0.01, strategy=Averaging()),
        Agent(nb_actions=nb_arms, epsilon=0.10, strategy=Averaging()),
        Agent(nb_actions=nb_arms, epsilon=0.00, strategy=WeightedAveraging(0.5)),
        Agent(nb_actions=nb_arms, epsilon=0.01, strategy=WeightedAveraging(0.5)),
        Agent(nb_actions=nb_arms, epsilon=0.10, strategy=WeightedAveraging(0.5)),
    ]

    for agent in agents:

        print('Agent: %s' % str(agent))

        args = [(agent, nb_arms, nb_steps) for _ in range(nb_tasks)]
        results = pool.starmap(play_bandit, args)

        df_result = sum(results) / nb_tasks
        df_result.rewards.plot(ax=ax_rewards, label=str(agent))
        df_result.optimal_actions.plot(ax=ax_optimal)

    ax_rewards.set_title('Rewards')
    ax_rewards.set_ylabel('Average reward')
    ax_rewards.legend()
    ax_optimal.set_title('Optimal action')
    ax_optimal.set_ylabel('% optimal action')
    ax_optimal.set_xlabel('steps')
    plt.xlim([0, nb_steps])
    plt.show()


if __name__ == '__main__':
    main()

1 个答案:

答案 0 :(得分:4)

在更新规则的公式中

new_estimate = current_estimate + 1.0 / step * (reward - current_estimate)

参数step应该是采用特定action的次数,而不是模拟的总步数。因此,您需要将该变量与操作值一起存储,以便将其用于更新。

这也可以从第2.4章“增量实现”结尾的伪代码框中看到:

Screenshot

(来源:理查德·萨顿(Richard S.Sutton)和安德鲁·巴托(Andrew G.Barto):强化学习-简介,第二版,2018年,第2.4章增量实现)