这基本上是a - b [：，无] - ＆gt;具有相同的运行时

Question

我在纯python中有以下功能：

import numpy as np

def subtractPython(a, b):
    xAxisCount = a.shape[0]
    yAxisCount = a.shape[1]

    shape = (xAxisCount, yAxisCount, xAxisCount)
    results = np.zeros(shape)
    for index in range(len(b)):
        subtracted = (a - b[index])
        results[:, :, index] = subtracted
    return results

我尝试用这种方式进行cythonize：

import numpy as np
cimport numpy as np

DTYPE = np.int
ctypedef np.int_t DTYPE_t

def subtractPython(np.ndarray[DTYPE_t, ndim=2] a, np.ndarray[DTYPE_t, ndim=2] b):
    cdef int xAxisCount = a.shape[0]
    cdef int yAxisCount = a.shape[1]

    cdef np.ndarray[DTYPE_t, ndim=3] results = np.zeros([xAxisCount, yAxisCount, xAxisCount], dtype=DTYPE)

    cdef int lenB = len(b)

    cdef np.ndarray[DTYPE_t, ndim=2] subtracted
    for index in range(lenB):
        subtracted = (a - b[index])
        results[:, :, index] = subtracted
    return results

然而，我没有看到任何加速。是否有我遗漏的东西或者这个过程无法加速？

编辑 - ＆gt;我已经意识到我实际上并没有在上面的代码中对减法算法进行cython化。我已经设法对它进行了cythonize，但它与a - b [：，None]具有完全相同的运行时间，所以我猜这是此操作的最大速度。

这基本上是a - b [：，无] - ＆gt;具有相同的运行时

%%cython

import numpy as np
cimport numpy as np


DTYPE = np.int
ctypedef np.int_t DTYPE_t

cimport cython
@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
def subtract(np.ndarray[DTYPE_t, ndim=2] a, np.ndarray[DTYPE_t, ndim=2] b):
    cdef np.ndarray[DTYPE_t, ndim=3] result = np.zeros([b.shape[0], a.shape[0], a.shape[1]], dtype=DTYPE)

    cdef int lenB = b.shape[0]
    cdef int lenA = a.shape[0]
    cdef int lenColB = b.shape[1]

    cdef int rowA, rowB, column

    for rowB in range(lenB):
        for rowA in range(lenA):
            for column in range(lenColB):
                result[rowB, rowA, column] = a[rowA, column] - b[rowB, column]
    return result

Answer 1

当试图优化一个功能时，总是应该知道这个功能的瓶颈是什么 - 没有你会花很多时间在错误的方向上运行。

让我们使用你的python函数作为基线（实际上我使用result=np.zeros(shape,dtype=a.dtype)否则你的方法返回floats这可能是一个错误）：

>>> import numpy as np
>>> a=np.random.randint(1,1000,(300,300), dtype=np.int)
>>> b=np.random.randint(1,1000,(300,300), dtype=np.int)
>>> %timeit subtractPython(a,b)
274 ms ± 3.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

我们应该问自己的第一个问题是：这个任务是内存还是CPU绑定的？显然，这是一个内存限制的任务 - 与所需的内存读取和写入访问相比，减法无关紧要。

这意味着，我们必须优化内存布局以减少缓存未命中。根据经验，我们的内存访问应该一个接一个地访问一个连续的内存地址。

是这样的吗？不，数组result是C顺序，即行主顺序，因此是访问

results[:, :, index] = subtracted

不是连续的。另一方面，

results[index, :, :] = subtracted

将是连续访问。让我们改变信息存储在result中的方式：

def subtract1(a, b):
    xAxisCount = a.shape[0]
    yAxisCount = a.shape[1]

    shape = (xAxisCount,  xAxisCount, yAxisCount) #<=== Change order
    results = np.zeros(shape, dtype=a.dtype)
    for index in range(len(b)):
        subtracted = (a - b[index])
        results[index, :, :] = subtracted   #<===== consecutive access
    return results

时间安排现在：

>>> %timeit subtract1(a,b)
>>> 35.8 ms ± 285 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

还有两个小的改进：我们不必用零来初始化结果，我们可以节省一些python开销，但这只给我们大约5％：

def subtract2(a, b):
    xAxisCount = a.shape[0]
    yAxisCount = a.shape[1]

    shape = (xAxisCount,  xAxisCount, yAxisCount) 
    results = np.empty(shape, dtype=a.dtype)        #<=== no need for zeros
    for index in range(len(b)):
        results[index, :, :] = (a-b[index])   #<===== less python overhead
    return results

>>> %timeit subtract2(a,b)
34.5 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

现在这比原始版本快8倍。

你可以使用Cython来尝试进一步加速 - 但是任务可能仍然受内存限制，所以不要期望得到它显着更快 - 毕竟cython无法让内存更快地运行。然而，如果没有适当的分析，很难说，有多少改进是可能的 - 如果有人想出一个更快的版本，也不会感到惊讶。

numpy函数cythonization

这基本上是a - b [：，无] - ＆gt;具有相同的运行时

1 个答案: