意外地减慢了cython卷积码

时间:2014-05-14 00:17:42

标签: python performance cython convolution

我需要快速计算一个矩阵,其条目是通过将滤波器与每行的矢量进行卷积,对所得矢量的条目进行二次抽样,然后将结果的点积与另一个矢量相乘得到的。具体来说,我想计算

M = [conv(e_j,f)* P_i * v_i] _ {i,j},

其中i从1变化到n,j从1变化到m。这里e_j是大小为n的指示符(行)向量,其中一个仅在列j中,f是长度为s的过滤器,P_i是一个(n + s-1)-by-k矩阵,它从中抽取适当的k个条目。卷积,v_i是长度为k的列向量。

计算M的每个条目需要O(n * s)运算,因此O(n * s * n * m)整体计算M.对于n = 6,m = 7,s = 3,一个核心我的计算机(8GLOPs)应该能够以大约0.094微秒计算M.然而,在example given in the Cython documentation之后,我非常简单的cython实现需要2毫秒以上的时间来计算这些参数的示例。这大约是4个数量级的差异!

这是一个包含Cython实现和测试代码的shar文件。复制并粘贴到文件中并运行'bash< fname>'在一个干净的目录中获取代码,然后运行'bash ./test.sh'来查看糟糕的性能。

cat > fastcalcM.pyx <<'EOF'

import numpy as np
cimport numpy as np
cimport cython
from scipy.signal import convolve

DTYPE=np.float32
ctypedef np.float32_t DTYPE_t

@cython.boundscheck(False)
def calcM(np.ndarray[DTYPE_t, ndim=1, negative_indices=False] filtertaps, int
        n, int m, np.ndarray[np.int_t, ndim=2, negative_indices=False]
        keep_indices, np.ndarray[DTYPE_t, ndim=2, negative_indices=False] V):
    """ Computes a numrows-by-k matrix M whose entries satisfy
        M_{i,k} = [conv(e_j, f)^T * P_i * v_i],
        where v_i^T is the i-th row of V, and P_i samples the entries from
        conv(e_j, f)^T indicated by the ith row of the keep_indices matrix """

    cdef int k = keep_indices.shape[1]

    cdef np.ndarray M = np.zeros((n, m), dtype=DTYPE)
    cdef np.ndarray ej = np.zeros((m,), dtype=DTYPE)
    cdef np.ndarray convolution
    cdef int rowidx, colidx, kidx

    for rowidx in range(n):
        for colidx in range(m):
            ej[colidx] = 1
            convolution = convolve(ej, filtertaps, mode='full')
            for kidx in range(k):
                M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx]
            ej[colidx] = 0

    return M

EOF
#-----------------------------------------------------------------------------
cat > test_calcM.py << 'EOF'

import numpy as np
from fastcalcM import calcM

filtertaps = np.array([-1, 2, -1]).astype(np.float32)
n, m = 6, 7
keep_indices = np.array([[1, 3], 
                         [4, 5],
                         [2, 2], 
                         [5, 5], 
                         [3, 4], 
                         [4, 5]]).astype(np.int)
V = np.random.random_integers(-5, 5, size=(6, 2)).astype(np.float32)

print calcM(filtertaps, n, m, keep_indices, V)

EOF
#-----------------------------------------------------------------------------
cat > test.sh << 'EOF'

python setup.py build_ext --inplace
echo -e "%run test_calcM\n%timeit calcM(filtertaps, n, m, keep_indices, V)" > script.ipy
ipython script.ipy

EOF
#-----------------------------------------------------------------------------
cat > setup.py << 'EOF'

from distutils.core import setup
from Cython.Build import cythonize
import numpy

setup(
    name="Fast convolutions",
    include_dirs = [numpy.get_include()],
    ext_modules = cythonize("fastcalcM.pyx")
)

EOF

我想也许调用scipy的卷积可能是罪魁祸首(我不确定cython和scipy能很好地协同工作),所以我在Cython文档中实现了我自己的卷积代码和同样的例子,但这导致了整体代码慢了大约10倍。

关于如何更接近理论上可能的速度的任何想法,或者差异如此之大的原因?

1 个答案:

答案 0 :(得分:4)

首先,Megconvolution的输入不允许快速编制索引。实际上,你所做的打字并不是特别有用。

但这没关系,因为你有两个开销。第一种是在Cython和Python类型之间进行转换。如果你想将它们大量传递给Python,你应该保持无类型数组,以防止转换。为了这个原因(1ms→0.65μs),将其移至Python加速了。

然后我描述了它:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    15                                           def calcM(filtertaps, n, m, keep_indices, V):
    16      4111         3615      0.9      0.1      k = keep_indices.shape[1]
    17      4111         8024      2.0      0.1      M = np.zeros((n, m), dtype=np.float32)
    18      4111         6090      1.5      0.1      ej = np.zeros((m,), dtype=np.float32)
    19                                           
    20     28777        18690      0.6      0.3      for rowidx in range(n):
    21    197328       123284      0.6      2.2          for colidx in range(m):
    22    172662       112348      0.7      2.0              ej[colidx] = 1
    23    172662      4076225     23.6     73.6              convolution = convolve(ej, filtertaps, mode='full')
    24    517986       395513      0.8      7.1              for kidx in range(k):
    25    345324       668309      1.9     12.1                  M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx]
    26    172662       120271      0.7      2.2              ej[colidx] = 0
    27                                           
    28      4111         2374      0.6      0.0      return M

在考虑任何其他内容之前,请先处理convolve

为什么convolve会变慢?嗯,它有很多开销。通常会numpy / scipy;它最适合大型数据集。如果您知道阵列的大小将保持较小,只需在Cython中重新实现convolve

哦,尝试使用缓冲区语法。将DTYPE[:, :]用于2D数组,将DTYPE[:]用于1D数组等。这是memoryview协议,它更好。在某些情况下,它有更多的开销,但通常可以解决这些问题,并且在大多数其他方面它会更好。


编辑:

您可以尝试(递归)内联scipy函数:

import numpy as np
from scipy.signal.sigtools import _correlateND

def calcM(filtertaps, n, m, keep_indices, V):
    k = keep_indices.shape[1]
    M = np.zeros((n, m), dtype=np.float32)
    ej = np.zeros((m,), dtype=np.float32)

    slice_obj = [slice(None, None, -1)] * len(filtertaps.shape)
    sliced_filtertaps_view = filtertaps[slice_obj]

    ps = ej.shape[0] + sliced_filtertaps_view.shape[0] - 1
    in1zpadded = np.zeros(ps, ej.dtype)
    out = np.empty(ps, ej.dtype)

    for rowidx in range(n):
        for colidx in range(m):
            in1zpadded[colidx] = 1

            convolution = _correlateND(in1zpadded, sliced_filtertaps_view, out, 2)

            for kidx in range(k):
                M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx]

            in1zpadded[colidx] = 0

    return M

请注意,这会使用私有实施细节。

这是针对特定尺寸量身定制的,所以我不知道它是否适用于您的实际数据。但它消除了绝大部分的开销。然后,您可以通过再次输入内容来改善这一点:

import numpy as np
cimport numpy as np
from scipy.signal.sigtools import _correlateND

DTYPE=np.float32
ctypedef np.float32_t DTYPE_t

def calcM(filtertaps, int n, int m, np.int_t[:, :] t_keep_indices, DTYPE_t[:, :] t_V):
    cdef int rowidx, colidx, kidx, k
    cdef DTYPE_t[:, :] t_M
    cdef DTYPE_t[:] t_in1zpadded, t_convolution

    k = t_keep_indices.shape[1]
    t_M = M = np.zeros((n, m), dtype=np.float32)
    ej = np.zeros((m,), dtype=np.float32)

    slice_obj = [slice(None, None, -1)] * len(filtertaps.shape)
    sliced_filtertaps_view = filtertaps[slice_obj]

    ps = ej.shape[0] + sliced_filtertaps_view.shape[0] - 1
    t_in1zpadded = in1zpadded = np.zeros(ps, ej.dtype)
    out = np.empty(ps, ej.dtype)

    for rowidx in range(n):
        for colidx in range(m):
            t_in1zpadded[colidx] = 1

            t_convolution = _correlateND(in1zpadded, sliced_filtertaps_view, out, 2)

            for kidx in range(k):
                t_M[rowidx, colidx] += t_convolution[<int>t_keep_indices[rowidx, kidx]] * t_V[rowidx, kidx]

            t_in1zpadded[colidx] = 0

    return M

速度超过10倍,但不如你的天空估计高。然后,这个计算开始时有点虚假;)。