给定一个表示概率的值矩阵,我试图编写一个返回值所属的bin的有效进程。例如:
sample = 0.5
x = np.array([0.1]*10)
np.digitize( sample, np.cumsum(x))-1
#returns 5
是我要找的结果。
根据{{1}}对于timeit
数组很少的元素,它的效率更高:
x
而对于更大的cdf = 0
for key,val in enumerate(x):
cdf += val
if sample<=cdf:
print key
break
数组,numpy解决方案更快。
问题是:
x
是列表的情况对其进行矢量化,其中每个项目与其自己的sample
数组相关联(x
将是2-D)? 在应用程序x
中包含边际概率;这是我需要减少x
答案 0 :(得分:2)
你可以在那里使用一些broadcasting
魔法 -
(x.cumsum(1) > sample[:,None]).argmax(1)-1
涉及的步骤:
予。沿着每一行执行cumsum。
II。对每个样本值使用每个cumsum行的广播比较,并查找第一次出现的样本小于cumsum值,表明x
之前的元素是我们正在寻找的索引。
分步运行 -
In [64]: x
Out[64]:
array([[ 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 ],
[ 0.8 , 0.96, 0.88, 0.36, 0.5 , 0.68, 0.71],
[ 0.37, 0.56, 0.5 , 0.01, 0.77, 0.88, 0.36],
[ 0.62, 0.08, 0.37, 0.93, 0.65, 0.4 , 0.79]])
In [65]: sample # one elem per row of x
Out[65]: array([ 0.5, 2.2, 1.9, 2.2])
In [78]: x.cumsum(1)
Out[78]:
array([[ 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 ],
[ 0.8 , 1.76, 2.64, 2.99, 3.49, 4.18, 4.89],
[ 0.37, 0.93, 1.43, 1.45, 2.22, 3.1 , 3.47],
[ 0.62, 0.69, 1.06, 1.99, 2.64, 3.04, 3.83]])
In [79]: x.cumsum(1) > sample[:,None]
Out[79]:
array([[False, False, False, False, False, True, True],
[False, False, True, True, True, True, True],
[False, False, False, False, True, True, True],
[False, False, False, False, True, True, True]], dtype=bool)
In [80]: (x.cumsum(1) > sample[:,None]).argmax(1)-1
Out[80]: array([4, 1, 3, 3])
# A loopy solution to verify results against
In [81]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
Out[81]: [4, 1, 3, 3]
边界案例:
建议的解决方案自动处理sample
值小于累计求和值最小值的情况 -
In [113]: sample[0] = 0.08 # editing first sample to be lesser than 0.1
In [114]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
Out[114]: [-1, 1, 3, 3]
In [115]: (x.cumsum(1) > sample[:,None]).argmax(1)-1
Out[115]: array([-1, 1, 3, 3])
对于sample
值大于累计求和值的最大值的情况,我们需要一个额外的步骤 -
In [116]: sample[0] = 0.8 # editing first sample to be greater than 0.7
In [121]: mask = (x.cumsum(1) > sample[:,None])
In [122]: idx = mask.argmax(1)-1
In [123]: np.where(mask.any(1),idx,x.shape[1]-1)
Out[123]: array([6, 1, 3, 3])
In [124]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
Out[124]: [6, 1, 3, 3]