Question

首先，我想为措辞严厉的标题道歉 - 我现在无法想出一个更好的方式来表达它。基本上，我想知道是否有更快的方法在Python中实现数组操作，其中每个操作以迭代方式依赖于先前的输出（例如，前向差分操作，过滤等）。基本上，操作的形式如下：

for n in range(1, len(X)):
    Y[n] = X[n] + X[n - 1] + Y[n-1]

其中X是值数组，Y是输出。在这种情况下，假设在上述循环之前单独知道或计算Y[0]。我的问题是：是否有NumPy功能来加速这种自引用循环？这是几乎所有脚本的主要瓶颈。我知道NumPy例程可以从C例程执行中受益，所以我很好奇是否有人知道任何有助于此的numpy例程。否则，是否有更好的方法来编程这个循环（在Python中），这将加速其对大数组大小的执行？（＆gt; 500,000个数据点）。

Answer 1

访问单个 NumPy数组元素或（元素 - ）迭代NumPy数组很慢（比如非常慢）。如果您想对NumPy阵列进行手动迭代：不要这样做！

但你有一些选择。最简单的方法是将数组转换为Python列表并迭代列表（听起来很愚蠢，但请留在我身边 - 我会在答案结尾处提供一些基准¹）：< / p>

X = X.tolist()
Y = Y.tolist()
for n in range(1, len(X)):
    Y[n] = X[n] + X[n - 1] + Y[n-1]

如果您还对列表使用直接迭代，则可能更快：

X = X.tolist()
Y = Y.tolist()
for idx, (Y_n_m1, X_n, X_n_m1) in enumerate(zip(Y, X[1:], X), 1):
    Y[idx] = X_n + X_n_m1 + Y_n_m1

然后有更复杂的选项需要额外的包。最值得注意的是Cython和Numba，它们旨在直接处理数组元素，并尽可能避免Python开销。例如，使用Numba，你可以在jitted（即时编译）函数中使用你的方法：

import numba as nb

@nb.njit
def func(X, Y):
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]

X和Y可以是NumPy数组，但是numba会直接对缓冲区起作用，超过其他方法（可能是数量级）。

Numba是一个更重的＆＃34;依赖性比Cython，但它可以更快更容易使用。但是没有conda，很难安装numba ...... YMMV

然而，这里也是代码的Cython版本（使用IPython魔法编译，如果你不使用IPython，它会有点不同）：

In [1]: %load_ext cython

In [2]: %%cython
   ...:
   ...: cimport cython
   ...:
   ...: @cython.boundscheck(False)
   ...: @cython.wraparound(False)
   ...: cpdef cython_indexing(double[:] X, double[:] Y):
   ...:     cdef Py_ssize_t n
   ...:     for n in range(1, len(X)):
   ...:         Y[n] = X[n] + X[n - 1] + Y[n-1]
   ...:     return Y

仅举一个例子（基于the timing framework from my answer to another question），关于时间安排：

import numpy as np
import numba as nb
import scipy.signal

def numpy_indexing(X, Y):
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]
    return Y

def list_indexing(X, Y):
    X = X.tolist()
    Y = Y.tolist()
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]
    return Y

def list_direct(X, Y):
    X = X.tolist()
    Y = Y.tolist()
    for idx, (Y_n_m1, X_n, X_n_m1) in enumerate(zip(Y, X[1:], X), 1):
        Y[idx] = X_n + X_n_m1 + Y_n_m1
    return Y

@nb.njit
def numba_indexing(X, Y):
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]
    return Y


def numpy_cumsum(X, Y):
    Y[1:] = X[1:] + X[:-1]
    np.cumsum(Y, out=Y)
    return Y

def scipy_lfilter(X, Y):
    a = [1, -1]
    b = [1, 1]
    return Y[0] - X[0] + scipy.signal.lfilter(b, a, X)

# Make sure the approaches give the same result
X = np.random.random(10000)
Y = np.zeros(10000)
Y[0] = np.random.random()

np.testing.assert_array_equal(numba_indexing(X, Y), numpy_indexing(X, Y))
np.testing.assert_array_equal(numba_indexing(X, Y), numpy_cumsum(X, Y))
np.testing.assert_almost_equal(numba_indexing(X, Y), scipy_lfilter(X, Y))
np.testing.assert_array_equal(numba_indexing(X, Y), cython_indexing(X, Y))

# Timing setup
timings = {numpy_indexing: [], 
           list_indexing: [], 
           list_direct: [],
           numba_indexing: [],
           numpy_cumsum: [],
           scipy_lfilter: [],
           cython_indexing: []}
sizes = [2**i for i in range(1, 20, 2)]

# Timing
for size in sizes:
    X = np.random.random(size=size)
    Y = np.zeros(size)
    Y[0] = np.random.random()
    for func in timings:
        res = %timeit -o func(X, Y)
        timings[func].append(res)

# Plottig absolute times

%matplotlib notebook
import matplotlib.pyplot as plt

fig = plt.figure(1)
ax = plt.subplot(111)

for func in timings:
    ax.plot(sizes, 
            [time.best for time in timings[func]], 
            label=str(func.__name__))
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()

# Plotting relative times

fig = plt.figure(1)
ax = plt.subplot(111)

baseline = numba_indexing # choose one function as baseline
for func in timings:
    ax.plot(sizes, 
            [time.best / ref.best for time, ref in zip(timings[func], timings[baseline])], 
            label=str(func.__name__))
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time relative to "{}"'.format(baseline.__name__))
ax.grid(which='both')
ax.legend()

plt.tight_layout()

得到以下结果：

绝对运行时

相对运行时间（与numba函数相比）

因此，只需将其转换为列表即可快3倍！通过在这些列表上直接迭代，您可以获得另一个（更小的）加速，在此基准测试中只有20％，但与原始解决方案相比，我们现在快了近4倍。使用numba，与列表操作相比，您可以将速度提高100倍以上！ Cython只比numba慢一点（约40-50％），可能是因为我没有把所有可能的优化（通常它的速度不超过10-20％）挤掉你可以做的用Cython。但是对于大型阵列，差异会变小。

¹我确实在another answer中详细介绍了。 Q + A是关于转换为set，但因为set使用（隐藏）＆＃34;手动迭代＆＃34;它也适用于此。

我列出了NumPy cumsum和Scipy lfilter方法的时间安排。与numba函数相比，对于小阵列，这些大约慢20倍，对于大阵列大约慢4倍。但是，如果我正确地解释了这个问题，那么你不仅要寻找一些方法，而不仅仅是在例子中应并非每个自引用循环都可以使用NumPy或SciPys过滤器中的cum*函数来实现。但即便如此，他们似乎也无法与Cython和/或numba竞争。

Answer 2

使用np.cumsum非常简单：

#!/usr/bin/env python3
import numpy as np
import random

def r():
    return random.randint(100, 1000)
X = np.array([r() for _ in range(10)])
fast_Y = np.ndarray(X.shape, dtype=X.dtype)
slow_Y = np.ndarray(X.shape, dtype=X.dtype)
slow_Y[0] = fast_Y[0] = r()

# fast method
fast_Y[1:] = X[1:] + X[:-1]
np.cumsum(fast_Y, out=fast_Y)

# original method
for n in range(1, len(X)):
    slow_Y[n] = X[n] + X[n - 1] + slow_Y[n-1]


assert (fast_Y == slow_Y).all()

Answer 3

您描述的情况基本上是一个离散的过滤操作。这是在scipy.signal.lfilter中实现的。您描述的特定条件对应于a = [1, -1]和b = [1, 1]。

import numpy as np
import scipy.signal

a = [1, -1]
b = [1, 1]

X = np.random.random(10000)
Y = np.zeros(10000)

newY = scipy.signal.lfilter(b, a, X) + (Y[0] - X[0])

在我的电脑上，时间安排如下：

%timeit func4(X, Y.copy())
# 100000 loops, best of 3: 14.6 µs per loop

% timeit newY = scipy.signal.lfilter(b, a, X) - (Y[0] - X[0])
# 10000 loops, best of 3: 68.1 µs per loop

迭代使用自己的输出的数组的最佳方法

3 个答案:

绝对运行时

相对运行时间（与numba函数相比）