Question

财务和强化学习中的常用术语是基于原始奖励C[i]的时间序列的折扣累积奖励R[i]。给定数组R，我们希望使用C[i]计算满足重复C[i] = R[i] + discount * C[i+1]的{{1}}（并返回完整数组C[-1] = R[-1]）。< / p>

使用numpy数组在python中计算这个数值稳定的方法可能是：

但是，这依赖于python循环。鉴于这是一个如此常见的计算，当然现有的矢量化解决方案依赖于其他一些标准函数，而不需要求助于cythonization。

请注意，使用import numpy as np def cumulative_discount(rewards, discount): future_cumulative_reward = 0 assert np.issubdtype(rewards.dtype, np.floating), rewards.dtype cumulative_rewards = np.empty_like(rewards) for i in range(len(rewards) - 1, -1, -1): cumulative_rewards[i] = rewards[i] + discount * future_cumulative_reward future_cumulative_reward = cumulative_rewards[i] return cumulative_rewards等内容的任何解决方案都不会稳定。

Answer 1

您可以使用scipy.signal.lfilter来解决递归关系：

def alt(rewards, discount):
    """
    C[i] = R[i] + discount * C[i+1]
    signal.lfilter(b, a, x, axis=-1, zi=None)
    a[0]*y[n] = b[0]*x[n] + b[1]*x[n-1] + ... + b[M]*x[n-M]
                          - a[1]*y[n-1] - ... - a[N]*y[n-N]
    """
    r = rewards[::-1]
    a = [1, -discount]
    b = [1]
    y = signal.lfilter(b, a, x=r)
    return y[::-1]

此脚本测试结果是否相同：

import scipy.signal as signal
import numpy as np

def orig(rewards, discount):
    future_cumulative_reward = 0
    cumulative_rewards = np.empty_like(rewards, dtype=np.float64)
    for i in range(len(rewards) - 1, -1, -1):
        cumulative_rewards[i] = rewards[i] + discount * future_cumulative_reward
        future_cumulative_reward = cumulative_rewards[i]
    return cumulative_rewards

def alt(rewards, discount):
    """
    C[i] = R[i] + discount * C[i+1]
    signal.lfilter(b, a, x, axis=-1, zi=None)
    a[0]*y[n] = b[0]*x[n] + b[1]*x[n-1] + ... + b[M]*x[n-M]
                          - a[1]*y[n-1] - ... - a[N]*y[n-N]
    """
    r = rewards[::-1]
    a = [1, -discount]
    b = [1]
    y = signal.lfilter(b, a, x=r)
    return y[::-1]

# test that the result is the same
np.random.seed(2017)

for i in range(100):
    rewards = np.random.random(10000)
    discount = 1.01
    expected = orig(rewards, discount)
    result = alt(rewards, discount)
    if not np.allclose(expected,result):
        print('FAIL: {}({}, {})'.format('alt', rewards, discount))
        break

Answer 2

您描述的计算称为Horner's rule或Horner评估多项式的方法。它在NumPy polynomial.polyval中实现。

但是，您需要整个cumulative_rewards列表，即Horner规则的所有中间步骤。 NumPy方法不返回那些中间值。你的功能，用Numba的@jit装饰，可能是最佳选择。

作为理论上的可能性，我会指出polyval如果给出Hankel matrix个系数，则可以返回整个列表。这是矢量化但最终效率低于Python循环，因为cumulative_reward的每个值都是从头开始计算的，与其他值无关。

from numpy.polynomial.polynomial import polyval
from scipy.linalg import hankel

rewards = np.random.uniform(10, 100, size=(100,))
discount = 0.9
print(polyval(discount, hankel(rewards)))

这匹配

的输出

print(cumulative_discount(rewards, discount))

Answer 3

如果你想要一个仅限 numpy 的解决方案，请选择这个（从 unutbu 的答案中借用结构）：

def alt2(rewards, discount):
    tmp = np.arange(rewards.size)
    tmp = tmp - tmp[:, np.newaxis]
    w = np.triu(discount ** tmp.clip(min=0)).T
    return (rewards.reshape(-1, 1) * w).sum(axis=0)

证明如下。

import numpy as np

def orig(rewards, discount):
    future_cumulative_reward = 0
    cumulative_rewards = np.empty_like(rewards, dtype=np.float64)
    for i in range(len(rewards) - 1, -1, -1):
        cumulative_rewards[i] = rewards[i] + discount * future_cumulative_reward
        future_cumulative_reward = cumulative_rewards[i]
    return cumulative_rewards

def alt2(rewards, discount):
    tmp = np.arange(rewards.size)
    tmp = tmp - tmp[:, np.newaxis]
    w = np.triu(discount ** tmp.clip(min=0)).T
    return (rewards.reshape(-1, 1) * w).sum(axis=0)

# test that the result is the same
np.random.seed(2017)

for i in range(100):
    rewards = np.random.random(100)
    discount = 1.01
    expected = orig(rewards, discount)
    result = alt2(rewards, discount)
    if not np.allclose(expected,result):
        print('FAIL: {}({}, {})'.format('alt', rewards, discount))
        break
else:
    print('success')

然而，这个解决方案不能很好地扩展到大的奖励数组，但你仍然可以使用跨步技巧来解决，as pointed out here。

矢量化一个numpy折扣计算

3 个答案: