Question

我正在寻找一个快速解决MATLAB的accumarray numpy的问题。 accumarray累积属于同一索引的数组元素。一个例子：

a = np.arange(1,11)
# array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
accmap = np.array([0,1,0,0,0,1,1,2,2,1])

结果应为

array([13, 25, 17])

到目前为止我做了什么： 我已尝试recipe here中的accum功能，该功能正常但速度很慢。

accmap = np.repeat(np.arange(1000), 20)
a = np.random.randn(accmap.size)
%timeit accum(accmap, a, np.sum)
# 1 loops, best of 3: 293 ms per loop

然后我尝试使用应该更快工作的solution here，但它无法正常工作：

accum_np(accmap, a)
# array([  1.,   2.,  12.,  13.,  17.,  10.])

是否有可以像这样积累的内置numpy功能？还是其他任何建议？

Answer 1

将np.bincount与weights可选参数一起使用。在你的例子中，你会这样做：

np.bincount(accmap, weights=a)

Answer 2

晚会，但......

正如@Jamie所说，对于求和的情况，np.bincount快速而简单。但是在更一般的情况下，对于其他ufuncs，例如maximum，您可以使用np.ufunc.at方法。

我把 ~~a gist~~ [见下面的链接]放在一起，将其封装在类似Matlab的界面中。它还利用重复的索引规则来提供'last'和'first'函数，与Matlab不同，'mean'得到明智优化（使用accumarray调用@mean在Matlab中真的很慢，因为它为每个组运行一个非内置函数，这是愚蠢的。）

请注意，我没有特别测试过要点，但希望将来能够通过额外的功能和错误修正来更新它。

2015年5月/ 6月更新：我已经重新设计了我的实现 - 它现在作为ml31415/numpy-groupies的一部分提供，可在PyPi（pip install numpy-groupies）上使用。基准如下（请参阅github repo获取最新值）......

function  pure-py  np-grouploop   np-ufuncat np-optimised    pandas        ratio
     std  1737.8ms       171.8ms     no-impl       7.0ms    no-impl   247.1: 24.4:  -  : 1.0 :  -  
     all  1280.8ms        62.2ms      41.8ms       6.6ms    550.7ms   193.5: 9.4 : 6.3 : 1.0 : 83.2
     min  1358.7ms        59.6ms      42.6ms      42.7ms     24.5ms    55.4: 2.4 : 1.7 : 1.7 : 1.0 
     max  1538.3ms        55.9ms      38.8ms      37.5ms     18.8ms    81.9: 3.0 : 2.1 : 2.0 : 1.0 
     sum  1532.8ms        62.6ms      40.6ms       1.9ms     20.4ms   808.5: 33.0: 21.4: 1.0 : 10.7
     var  1756.8ms       146.2ms     no-impl       6.3ms    no-impl   279.1: 23.2:  -  : 1.0 :  -  
    prod  1448.8ms        55.2ms      39.9ms      38.7ms     20.2ms    71.7: 2.7 : 2.0 : 1.9 : 1.0 
     any  1399.5ms        69.1ms      41.1ms       5.7ms    558.8ms   246.2: 12.2: 7.2 : 1.0 : 98.3
    mean  1321.3ms        88.3ms     no-impl       4.0ms     20.9ms   327.6: 21.9:  -  : 1.0 : 5.2 
Python 2.7.9, Numpy 1.9.2, Win7 Core i7.

此处我们使用从100,000统一挑选的[0, 1000)个索引。具体来说，大约25％的值是0（用于bool操作），其余值均匀分布在[-50,25)上。计时显示10次重复。

purepy - 只使用纯python，部分依赖于itertools.groupby。
np-grouploop - 使用numpy根据idx对值进行排序，然后使用split创建单独的数组，然后遍历这些数组，为每个阵列运行相关的numpy函数。
np-ufuncat - 使用numpy ufunc.at方法，这比我应该的慢 - 在我在numpy的github上创建的an issue中被删除回购。
np-optimisied - 使用自定义numpy索引/其他技巧来击败上述两种实现（min max prod依赖ufunc.at除外）。< / LI>
pandas - pd.DataFrame({'idx':idx, 'vals':vals}).groupby('idx').sum()等。

请注意，某些no-impl可能是无根据的，但我还没有打算让它们继续工作。

正如github上所解释的，accumarray现在支持nan - 前缀函数（例如nansum）以及sort，rsort和{{ 1}}。它也适用于多维索引。

Answer 3

我用scipy.weave编写了一个accumarray实现，并将其上传到github：https://github.com/ml31415/numpy-groupies

Answer 4

不如接受的答案，但是：

[np.sum([a[x] for x in y]) for y in [list(np.where(accmap==z)) for z in np.unique(accmap).tolist()]]

这需要108us per loop（100000个循环，最好的3个）

接受的答案（np.bincount(accmap, weights=a）需要2.05us per loop（100000次循环，最好是3次）

Answer 5

以下内容如何：

import numpy

def accumarray(a, accmap):

    ordered_indices = numpy.argsort(accmap)

    ordered_accmap = accmap[ordered_indices]

    _, sum_indices = numpy.unique(ordered_accmap, return_index=True)

    cumulative_sum = numpy.cumsum(a[ordered_indices])[sum_indices-1]

    result = numpy.empty(len(sum_indices), dtype=a.dtype)
    result[:-1] = cumulative_sum[1:]
    result[-1] = cumulative_sum[0]

    result[1:] = result[1:] - cumulative_sum[1:]

    return result

Answer 6

您可以在一行中使用pandas DataFrame执行此操作。

In [159]: df = pd.DataFrame({"y":np.arange(1,11),"x":[0,1,0,0,0,1,1,2,2,1]})

In [160]: df
Out[160]: 
   x   y
0  0   1
1  1   2
2  0   3
3  0   4
4  0   5
5  1   6
6  1   7
7  2   8
8  2   9
9  1  10

In [161]: pd.pivot_table(df,values='y',index='x',aggfunc=sum)
Out[161]: 
    y
x    
0  13
1  25
2  17

您可以告诉pivot_table使用特定列作为索引和值，并获取新的DataFrame对象。当您指定聚合函数作为总和时，结果将与Matlab的accumarray相同。

Answer 7

这取决于您要尝试执行的操作，但是numpy unique具有一堆可选输出，您可以使用这些输出进行累加。如果您的数组具有多个相同的值，则unique会通过将return_counts选项设置为true来计算有多少个相同的值。在某些简单的应用程序中，这就是您要做的全部。

numpy.unique(ar, return_index=False, return_inverse=False, return_counts=True, axis=None)

您还可以将索引设置为true，并使用它来累加另一个数组。

在numpy中有一个等效的MATLAB准确吗？

7 个答案: