Question

我找不到性能增强问题的解决方案。

我有一个1D数组，我想计算索引滑动窗口的总和，这是一个示例代码：

import numpy as np
input = np.linspace(1, 100, 100)
list_of_indices = [[0, 10], [5, 15], [45, 50]] #just an example
output = np.array([input[idx[0]: idx[1]].sum() for idx in list_of_indices])

与numpy矢量化内置函数相比，output数组的计算速度极慢。在现实生活中，我的list_of_indices包含数万[lower bound, upper bound]对，这个循环绝对是高性能python脚本的瓶颈。

如何处理这个问题，使用numpy内部函数：如面具，聪明的np.einsum或其他类似的东西？由于我在HPC领域工作，我也担心内存消耗。

在尊重性能要求的同时，有没有人能解决这个问题？

Answer 1

如果：

input与output或更短
output值具有相似的幅度

...您可以创建cumsum个输入值。然后总结变成减法。

cs = np.cumsum(input, dtype=float32)  # or float64 if you need it
loi = np.array(list_of_indices, dtype=np.uint16)
output = cs[loi[:,1]] - cs[loi[:,0]]

如果input运行大而小的值，则此处的数值危险是精度损失。那么cumsum可能不够准确。

Answer 2

这是一个简单的尝试方法：保持与您已有的相同的解决方案结构，这可能有效。只需使存储创建和索引更有效。如果对大多数索引求和来自input的许多元素，则总和应该比for循环花费更多时间。例如：

# Put all the indices in a nice efficient structure:
idxx = np.hstack((np.array(list_of_indices, dtype=np.uint16),
    np.arange(len(list_of_indices), dtype=np.uint16)[:,None]))
# Allocate appropriate data type to the precision and range you need,
# Do it in one go to be time-efficient
output = np.zeros(len(list_of_indices), dtype=np.float32) 
for idx0, idx1, idxo in idxx:
    output[idxo] = input[idx0:idx1].sum()

如果len(list_if_indices) > 2**16，请使用uint32而不是uint16。

如何迭代切片列表？

2 个答案: