Question

这是我的数据：

a = np.array([[1,2],[2,1],[7,1],[3,2]])

我想在这里为第二行中的每个数字求和。因此，在该示例中，第二列中有两个可能的值：1和2。

我希望对第一列中第二列中具有相同值的所有值求和。是否有内置的numpy函数？

例如，第二列中每个1的总和将是：2 + 7 = 9

Answer 1

短暂但有点狡猾的方式是通过numpy函数bincount：

np.bincount(a[:,1], weights=a[:,0])

它的作用是计算数组中出现的次数0,1,2等（在这种情况下，a[:,1]是您的类别编号列表）。现在，weights将计数乘以某个权重，在这种情况下，这是您列表中的第一个值，实际上就是这样求和。

它返回的是：

array([ 0.,  9.,  4.])

其中0是第一个元素的总和，其中第二个元素是0，等等......所以，只有当你分组的第二个数字是整数时，它才会起作用。

如果它们不是从0开始的连续整数，您可以通过执行以下操作选择所需的整数：

np.bincount(a[:,1], weights=a[:,0])[np.unique(a[:,1])]

这将返回

array([9.,  4.])

这是一个sum数组，按第二个元素排序（因为unique返回一个排序列表）。

如果你的第二个元素不是整数，那么首先你会因为floating point arithmetic而遇到某种麻烦（你认为相同的元素在现实中可能会有所不同）。但是，如果您确定它没问题，您可以对它们进行排序并为它们分配整数（例如，使用scipy的rank函数）：

ind = rd(a[:,1], method = 'dense').astype(int) - 1 # ranking begins from 1, we need from 0
sums = np.bincount(ind, weights=a[:,0])

这将返回array([9., 4.])，按您的第二个元素排序。您可以将它们压缩成与适当元素配对：

zip(np.unique(a[:,1]), sums)

Answer 2

play.py的内容

import numpy as np

def compute_sum1(a):
    unique = np.unique(a[:, 1])
    same_idxs = ((u, np.argwhere(a[:, 1] == u)) for u in unique)
    # First coordinate of tuple contains value of col 2
    # Second coordinate contains the sum of entries from col 1
    same_sum = [(u, np.sum(a[idx, 0])) for u, idx in same_idxs]
    return same_sum

def compute_sum2(a):
    """A minimal implementation of compute_sum"""
    unique = np.unique(a[:, 1])
    same_idxs = (np.argwhere(a[:, 1] == u) for u in unique)
    same_sum = (np.sum(a[idx, 0]) for idx in same_idxs)
    return same_sum

def compute_sum3(a):
    unique = np.unique(a[:, 1])
    same_idxs = [np.argwhere(a[:, 1] == u) for u in unique]
    same_sum = np.sum(a[same_idxs, 0].squeeze(), 1)
    return same_sum

def main():
    a = np.array([[1,2],[2,1],[7,1],[3,2]]).astype("float")
    print("compute_sum1")
    print(compute_sum1(a))
    print("compute_sum3")
    print(compute_sum3(a))
    print("compute_sum2")
    same_sum = [s for s in compute_sum2(a)]
    print(same_sum)


if __name__ == '__main__':
    main()

输出：

In [59]: play.main()
compute_sum1
[(1.0, 9.0), (2.0, 4.0)]
compute_sum3
[ 9.  4.]
compute_sum2
[9.0, 4.0]

In [60]: %timeit play.compute_sum1(a)
10000 loops, best of 3: 95 µs per loop

In [61]: %timeit play.compute_sum2(a)
100000 loops, best of 3: 14.1 µs per loop

In [62]: %timeit play.compute_sum3(a)
10000 loops, best of 3: 77.4 µs per loop

请注意compute_sum2()是最快的。如果你的矩阵是巨大的，我建议使用这个实现，因为它使用生成器理解而不是列表理解，这是更高的内存效率。同样，same_sum中的compute_sum1()可以通过将[]替换为()来转换为生成器理解。

Answer 3

您可能需要查看此库：https://github.com/ml31415/accumarray。它是来自matlabs accumarray的克隆，用于numpy。

a = np.array([[1,2],[2,1],[7,1],[3,2]])
accum(a[:,1], a[:,0])
>>> array([0, 9, 4])

第一个0意味着索引列中没有0字段。

Answer 4

我看到的最容易直接的方式是列表理解：

s = [[sum(x[0] for x in a if x[1] == y), y] for y in set([q[1] for q in a])]

但是，如果列表中的第二个数字代表某种类别，我建议您将数据转换为字典。

Answer 5

据我所知，numpy无法执行此操作，但可以使用pandas.DataFrame.groupby轻松完成此操作。

In [7]: import pandas as pd
In [8]: import numpy as np
In [9]: a = np.array([[1,2],[2,1],[7,1],[3,2]])
In [10]: df = pd.DataFrame(a)
In [11]: df.groupby(1)[0].sum()
Out[11]: 
1
1    9
2    4
Name: 0, dtype: int64

当然，您可以使用itertools.groupby

执行相同的操作

In [1]: import numpy as np
   ...: from itertools import groupby
   ...: from operator import itemgetter
   ...: 

In [3]: a = np.array([[1,2],[2,1],[7,1],[3,2]])

In [4]: sa = sorted(a.tolist(), key=itemgetter(1))

In [5]: grouper = groupby(sa, key=itemgetter(1))

In [6]: sums = {idx : sum(row[0] for row in group) for idx, group in grouper}

In [7]: sums
Out[7]: {1: 9, 2: 4}

操作python Numpy数组中的数据：使用一列中的值对相邻值求和

5 个答案: