NumPy:选择数据并将其汇总到数组中

时间:2016-06-29 18:34:07

标签: python numpy

我有一个(大)数据数组和(一些)索引列表的(大)列表,例如,

data = [1.0, 10.0, 100.0]
contribs = [[1, 2], [0], [0, 1]]

对于contribs中的每个条目,我想总结data的相应值并将它们放入数组中。对于上面的示例,预期结果将是

out = [110.0, 1.0, 11.0]

在循环中执行此操作,

c = numpy.zeros(len(contribs))
for k, indices in enumerate(contribs):
    for idx in indices:
        c[k] += data[idx]

但由于datacontribs很大,因此花费的时间过长。

我觉得使用numpy的花式索引可以改善这一点。

任何提示?

3 个答案:

答案 0 :(得分:5)

一种可能性是

data = np.array(data)
out = [np.sum(data[c]) for c in contribs]

应该比双循环更快,至少。

答案 1 :(得分:2)

这是几乎矢量化的 *方法 -

# Get lengths of list element in contribs and the cumulative lengths
# to be used for creating an ID array later on.
clens = np.cumsum([len(item) for item in contribs])

# Setup ID array that corresponds to same ID for same list element in contribs.
# These IDs would be used to accumulate values from a corresponnding array
#  that is created by indexing into data array with a flattened contribs
id_arr = np.zeros(clens[-1],dtype=int)
id_arr[clens[:-1]] = 1
out = np.bincount(id_arr.cumsum(),np.take(data,np.concatenate(contribs)))

这种方法涉及一些设置工作。因此,当在contribs中使用适当大小的输入数组和相当数量的列表元素时,可以看到好处,这将与其他循环解决方案中的循环相对应。

*请注意,这被创造为几乎矢量化,因为这里执行的唯一循环是在开始,我们得到列表元素的长度。但是那个计算要求不高的部分对总运行时间的影响应该很小。

答案 2 :(得分:0)

我不确定所有案例都有效,但就您的例子而言,datanumpy.array

# Flatten "contribs"
f = [j for i in contribs for j in i]

# Get the "ranges" of data[f] that will be summed in the next step
i = [0] + numpy.cumsum([len(i) for i in contribs]).tolist()[:-1]

# Take the required sums
numpy.add.reduceat(data[f], i)