快速Python / Numpy频率 - 严重性分布模拟

时间:2016-10-04 19:38:00

标签: python performance numpy vectorization

我正在寻找一个能够模拟经典频率严重性分布的东西,例如: X = sum(i = 1..N,Y_i),其中N例如是泊松分布和Y对数正态分布。

简单天真的numpy脚本将是:

import numpy as np
SIM_NUM = 3

X = []
for _i in range(SIM_NUM):
    nr_claims = np.random.poisson(1)
    temp = []
    for _j in range(nr_claims):
         temp.append(np.random.lognormal(0, 1))
    X.append(sum(temp))

现在我尝试对其进行矢量化以提高性能。以下更好一点:

N = np.random.poisson(1, SIM_NUM)
X = []
for n in N:
    X.append(sum(np.random.lognormal(0, 1, n)))

我的问题是我是否仍然能够对第二个循环进行矢量化?例如,通过模拟所有损失:

N = np.random.poisson(1, SIM_NUM)
# print(N) would lead to something like [1 3 0]
losses = np.random.lognormal(0,1, sum(N))
# print(N) would lead to something like 
#[ 0.56750244  0.84161871  0.41567216  1.04311742]

# X should now be [ 0.56750244, 0.84161871 + 0.41567216 + 1.04311742, 0] 

我的想法是"智能切片"或"矩阵乘法,A = [[1,0,0,0]],[0,1,1,1],[0,0,0,0]]。但我无法巧妙地将这些想法变得清晰。

我正在寻找最快的X计算。

2 个答案:

答案 0 :(得分:3)

我们可以使用np.bincount,这对于基于区间/ ID的求和操作非常有效,特别是在使用1D数组时。实现看起来像这样 -

# Generate all poisson distribution values in one go
pv = np.random.poisson(1,SIM_NUM)

# Use poisson values to get count of total for random lognormal needed.
# Generate all those random numbers again in vectorized way 
rand_arr = np.random.lognormal(0, 1, pv.sum())

# Finally create IDs using pv as extents for use with bincount to do
# ID based and thus effectively interval-based summing
out = np.bincount(np.arange(pv.size).repeat(pv),rand_arr,minlength=SIM_NUM)

运行时测试 -

功能定义:

def original_app1(SIM_NUM):
    X = []
    for _i in range(SIM_NUM):
        nr_claims = np.random.poisson(1)
        temp = []
        for _j in range(nr_claims):
             temp.append(np.random.lognormal(0, 1))
        X.append(sum(temp))
    return X

def original_app2(SIM_NUM):
    N = np.random.poisson(1, SIM_NUM)
    X = []
    for n in N:
        X.append(sum(np.random.lognormal(0, 1, n)))
    return X

def vectorized_app1(SIM_NUM):
    pv = np.random.poisson(1,SIM_NUM)
    r = np.random.lognormal(0, 1,pv.sum())
    return np.bincount(np.arange(pv.size).repeat(pv),r,minlength=SIM_NUM)

关于大型数据集的计时:

In [199]: SIM_NUM = 1000

In [200]: %timeit original_app1(SIM_NUM)
100 loops, best of 3: 2.6 ms per loop

In [201]: %timeit original_app2(SIM_NUM)
100 loops, best of 3: 6.65 ms per loop

In [202]: %timeit vectorized_app1(SIM_NUM)
1000 loops, best of 3: 252 µs per loop

In [203]: SIM_NUM = 10000

In [204]: %timeit original_app1(SIM_NUM)
10 loops, best of 3: 26.1 ms per loop

In [205]: %timeit original_app2(SIM_NUM)
10 loops, best of 3: 77.5 ms per loop

In [206]: %timeit vectorized_app1(SIM_NUM)
100 loops, best of 3: 2.46 ms per loop

所以,我们正在考虑那里的 10x+ 加速。

答案 1 :(得分:1)

您正在寻找numpy.add.reduceat

N = np.random.poisson(1, SIM_NUM)
losses = np.random.lognormal(0,1, np.sum(N))

x = np.zeros(SIM_NUM)
offsets = np.r_[0, np.cumsum(N[N>0])]
x[N>0] = np.add.reduceat(losses, offsets[:-1])

n == 0由于reduceat的工作方式而单独处理的情况。另外,请务必在数组上使用numpy.sum,而不要使用速度慢得多的sum

如果这比其他答案快,则取决于泊松分布的平均值。