我有一个(大)数据数组和(一些)索引列表的(大)列表,例如,
data = [1.0, 10.0, 100.0]
contribs = [[1, 2], [0], [0, 1]]
对于contribs
中的每个条目,我想总结data
的相应值并将它们放入数组中。对于上面的示例,预期结果将是
out = [110.0, 1.0, 11.0]
在循环中执行此操作,
c = numpy.zeros(len(contribs))
for k, indices in enumerate(contribs):
for idx in indices:
c[k] += data[idx]
但由于data
和contribs
很大,因此花费的时间过长。
我觉得使用numpy的花式索引可以改善这一点。
任何提示?
答案 0 :(得分:5)
一种可能性是
data = np.array(data)
out = [np.sum(data[c]) for c in contribs]
应该比双循环更快,至少。
答案 1 :(得分:2)
这是几乎矢量化的 *方法 -
# Get lengths of list element in contribs and the cumulative lengths
# to be used for creating an ID array later on.
clens = np.cumsum([len(item) for item in contribs])
# Setup ID array that corresponds to same ID for same list element in contribs.
# These IDs would be used to accumulate values from a corresponnding array
# that is created by indexing into data array with a flattened contribs
id_arr = np.zeros(clens[-1],dtype=int)
id_arr[clens[:-1]] = 1
out = np.bincount(id_arr.cumsum(),np.take(data,np.concatenate(contribs)))
这种方法涉及一些设置工作。因此,当在contribs
中使用适当大小的输入数组和相当数量的列表元素时,可以看到好处,这将与其他循环解决方案中的循环相对应。
*请注意,这被创造为几乎矢量化,因为这里执行的唯一循环是在开始,我们得到列表元素的长度。但是那个计算要求不高的部分对总运行时间的影响应该很小。
答案 2 :(得分:0)
我不确定所有案例都有效,但就您的例子而言,data
为numpy.array
:
# Flatten "contribs"
f = [j for i in contribs for j in i]
# Get the "ranges" of data[f] that will be summed in the next step
i = [0] + numpy.cumsum([len(i) for i in contribs]).tolist()[:-1]
# Take the required sums
numpy.add.reduceat(data[f], i)