Question

假设我有一个像numpy数组： [1,2,3,4,5,6] 和另一个数组： [0,0,1,2,2,1] 我想按组（第二个数组）对第一个数组中的项求和，并按组号顺序获取n组结果（在这种情况下，结果将是[3,9,9]）。我怎么做numpy？

Answer 1

这是一种基于numpy.unique的实现来实现此总和的矢量化方法。根据我的时间，它比循环方法快500倍，比直方图方法快100倍。

def sum_by_group(values, groups):
    order = np.argsort(groups)
    groups = groups[order]
    values = values[order]
    values.cumsum(out=values)
    index = np.ones(len(groups), 'bool')
    index[:-1] = groups[1:] != groups[:-1]
    values = values[index]
    groups = groups[index]
    values[1:] = values[1:] - values[:-1]
    return values, groups

Answer 2

numpy函数bincount完全是为了这个目的而制作的，我确信它会比其他方法对所有大小的输入快得多：

data = [1,2,3,4,5,6]
ids  = [0,0,1,2,2,1]

np.bincount(ids, weights=data) #returns [3,9,9] as a float64 array

输出的第i个元素是对应于＆＃34; id＆＃34的所有data个元素的总和。 i。

希望有所帮助。

Answer 3

有多种方法可以做到这一点，但这是一种方式：

import numpy as np
data = np.arange(1, 7)
groups = np.array([0,0,1,2,2,1])

unique_groups = np.unique(groups)
sums = []
for group in unique_groups:
    sums.append(data[groups == group].sum())

你可以对事物进行矢量化，以便根本没有for循环，但我建议不要这样做。它变得不可读，并且需要一些2D临时阵列，如果你有大量数据，可能需要大量内存。

编辑：这是你可以完全矢量化的一种方式。请记住，这可能（并且可能会）比上面的版本慢。（并且可能有一种更好的方法来对此进行矢量化，但是现在已经很晚了，我已经累了，所以这只是我头脑中的第一件事......）

然而，请记住，这是一个不好的例子......在上面的循环中你真的会更好（在速度和可读性方面）......

import numpy as np
data = np.arange(1, 7)
groups = np.array([0,0,1,2,2,1])

unique_groups = np.unique(groups)

# Forgive the bad naming here...
# I can't think of more descriptive variable names at the moment...
x, y = np.meshgrid(groups, unique_groups)
data_stack = np.tile(data, (unique_groups.size, 1))

data_in_group = np.zeros_like(data_stack)
data_in_group[x==y] = data_stack[x==y]

sums = data_in_group.sum(axis=1)

Answer 4

如果组被连续的整数编入索引，您可以滥用numpy.histogram()函数来获得结果：

data = numpy.arange(1, 7)
groups = numpy.array([0,0,1,2,2,1])
sums = numpy.histogram(groups, 
                       bins=numpy.arange(groups.min(), groups.max()+2), 
                       weights=data)[0]
# array([3, 9, 9])

这将避免任何Python循环。

Answer 5

我尝试过每个人的脚本，我的考虑是：

乔：只有少数几个小组才会有用。

kevpie：由于循环太慢（这不是pythonic方式）

Bi_Rico和Sven：表现不错，但仅适用于Int32（如果总和超过2 ^ 32/2则会失败）

亚历克斯：是最快的一个，总和很好。

但是，如果您想要更灵活，并且可以使用其他统计信息进行分组，请使用SciPy：

from scipy import ndimage

data = np.arange(10000000)
groups = np.arange(1000).repeat(10000)
ndimage.sum(data, groups, range(1000))

这很好，因为您有很多要统计的分组（总和，均值，方差，......）。

Answer 6

你错了！最好的方法是：

a = [1,2,3,4,5,6]
ix = [0,0,1,2,2,1]
accum = np.zeros(np.max(ix)+1)
np.add.at(accum, ix, a)
print accum
> array([ 3.,  9.,  9.])

Answer 7

纯python实现：

l = [1,2,3,4,5,6]
g = [0,0,1,2,2,1]

from itertools import izip
from operator import itemgetter
from collections import defaultdict

def group_sum(l, g):
    groups = defaultdict(int)
    for li, gi in izip(l, g):
        groups[gi] += li
    return map(itemgetter(1), sorted(groups.iteritems()))

print group_sum(l, g)

[3, 9, 9]

Answer 8

我注意到numpy标记，但如果你不介意使用pandas，这个任务就变成了一个单行：

import pandas as pd
import numpy as np

data = np.arange(1, 7)
groups = np.array([0, 0, 1, 2, 2, 1])

df = pd.DataFrame({'data': data, 'groups': groups})

所以df看起来像这样：

   data  groups
0     1       0
1     2       0
2     3       1
3     4       2
4     5       2
5     6       1

现在您可以使用groupby()和sum()

这些功能

print df.groupby(['groups'], sort=False).sum()

为您提供所需的输出

        data
groups      
0          3
1          9
2          9

默认情况下，数据框将被排序，因此我使用标记sort=False，这可能会提高大型数据帧的速度。

Answer 9

另外，请注意亚历克斯的回答：

data = [1,2,3,4,5,6]
ids  = [0,0,1,2,2,1]
np.bincount(ids, weights=data) #returns [3,9,9] as a float64 array

如果您的索引不连续，您可能会陷入思考为什么总是得到很多零的问题。

例如：

data = [1,2,3,4,5,6]
ids  = [1,1,3,5,5,3]
np.bincount(ids, weights=data)

会给你：

array([0, 3, 0, 9, 0, 9])

这显然意味着它在列表中构建了从 0 到 max id 的所有唯一 bin。然后返回每个 bin 的总和。

Answer 10

我尝试了不同的方法来做到这一点，我发现使用 np.bincount 确实是最快的。见亚历克斯的回答

    import numpy as np
    import random
    import time
    
    size = 10000
    ngroups = 10
    
    groups = np.random.randint(low=0,high=ngroups,size=size)
    values = np.random.rand(size)
    
    
    # Test 1                                                                                                                                                                                                           
    beg = time.time()
    result = np.zeros(ngroups)
    for i in range(size):
        result[groups[i]] += values[i]
    print('Test 1 took:',time.time()-beg)
    
    # Test 2                                                                                                                                                                                                           
    beg = time.time()
    result = np.zeros(ngroups)
    for g,v in zip(groups,values):
        result[g] += v
    print('Test 2 took:',time.time()-beg)
    
    # Test 3                                                                                                                                                                                                           
    beg = time.time()
    result = np.zeros(ngroups)
    for g in np.unique(groups):
        wh = np.where(groups == g)
        result[g] = np.sum(values[wh[0]])
    print('Test 3 took:',time.time()-beg)
    
    
    # Test 4                                                                                                                                                                                                           
    beg = time.time()
    result = np.zeros(ngroups)
    for g in np.unique(groups):
        wh = groups == g
        result[g] = np.sum(values, where = wh)
    print('Test 4 took:',time.time()-beg)
    
    # Test 5                                                                                                                                                                                                           
    beg = time.time()
    result = np.array([np.sum(values[np.where(groups == g)[0]]) for g in np.unique(groups) ])
    print('Test 5 took:',time.time()-beg)
    
    # Test 6                                                                                                                                                                                                           
    beg = time.time()
    result = np.array([np.sum(values, where = groups == g) for g in np.unique(groups) ])
    print('Test 6 took:',time.time()-beg)
    
    # Test 7                                                                                                                                                                                                           
    beg = time.time()
    result = np.bincount(groups, weights = values)
    print('Test 7 took:',time.time()-beg)

结果：

    Test 1 took: 0.005615234375
    Test 2 took: 0.004812002182006836
    Test 3 took: 0.0006084442138671875
    Test 4 took: 0.0005099773406982422
    Test 5 took: 0.000499725341796875
    Test 6 took: 0.0004980564117431641
    Test 7 took: 1.9073486328125e-05

在numpy中按数字求和数组

10 个答案: