Question

我有一个（大）数据数组和（一些）索引列表的（大）列表，例如，

data = [1.0, 10.0, 100.0]
contribs = [[1, 2], [0], [0, 1]]

对于contribs中的每个条目，我想总结data的相应值并将它们放入数组中。对于上面的示例，预期结果将是

out = [110.0, 1.0, 11.0]

在循环中执行此操作，

c = numpy.zeros(len(contribs))
for k, indices in enumerate(contribs):
    for idx in indices:
        c[k] += data[idx]

但由于data和contribs很大，因此花费的时间过长。

我觉得使用numpy的花式索引可以改善这一点。

任何提示？

Answer 1

一种可能性是

data = np.array(data)
out = [np.sum(data[c]) for c in contribs]

应该比双循环更快，至少。

Answer 2

这是几乎矢量化的 *方法 -

# Get lengths of list element in contribs and the cumulative lengths
# to be used for creating an ID array later on.
clens = np.cumsum([len(item) for item in contribs])

# Setup ID array that corresponds to same ID for same list element in contribs.
# These IDs would be used to accumulate values from a corresponnding array
#  that is created by indexing into data array with a flattened contribs
id_arr = np.zeros(clens[-1],dtype=int)
id_arr[clens[:-1]] = 1
out = np.bincount(id_arr.cumsum(),np.take(data,np.concatenate(contribs)))

这种方法涉及一些设置工作。因此，当在contribs中使用适当大小的输入数组和相当数量的列表元素时，可以看到好处，这将与其他循环解决方案中的循环相对应。

*请注意，这被创造为几乎矢量化，因为这里执行的唯一循环是在开始，我们得到列表元素的长度。但是那个计算要求不高的部分对总运行时间的影响应该很小。

Answer 3

我不确定所有案例都有效，但就您的例子而言，data为numpy.array：

# Flatten "contribs"
f = [j for i in contribs for j in i]

# Get the "ranges" of data[f] that will be summed in the next step
i = [0] + numpy.cumsum([len(i) for i in contribs]).tolist()[:-1]

# Take the required sums
numpy.add.reduceat(data[f], i)

NumPy：选择数据并将其汇总到数组中

3 个答案: