Cython:缓慢的numpy数组

时间:2017-02-20 16:26:07

标签: arrays numpy cython

我正在尝试使用cython来加速我的代码。在从python将代码翻译成cython后,我发现我没有获得任何加速。我认为问题的根源是我通过将numpy数组用于cython而获得的糟糕性能。

我想出了一个非常简单的程序来表明这一点:

############### test.pyx #################
import numpy as np
cimport numpy as np
cimport cython

def func1(long N):

    cdef double sum1,sum2,sum3
    cdef long i

    sum1 = 0.0
    sum2 = 0.0
    sum3 = 0.0

    for i in range(N):
        sum1 += i
        sum2 += 2.0*i
        sum3 += 3.0*i

    return sum1,sum2,sum3

def func2(long N):

    cdef np.ndarray[np.float64_t,ndim=1] sum_arr
    cdef long i

    sum_arr = np.zeros(3,dtype=np.float64)

    for i in range(N):
        sum_arr[0] += i
        sum_arr[1] += 2.0*i
        sum_arr[2] += 3.0*i

    return sum_arr

def func3(long N):

    cdef double sum_arr[3]
    cdef long i

    sum_arr[0] = 0.0
    sum_arr[1] = 0.0
    sum_arr[2] = 0.0

    for i in range(N):
        sum_arr[0] += i
        sum_arr[1] += 2.0*i
        sum_arr[2] += 3.0*i

    return sum_arr
##########################################

################## test.py ###############
import time
import test as test

N = 1000000000

for i in xrange(10):

    start = time.time()
    sum1,sum2,sum3 = test.func1(N)
    print 'Time taken = %.3f'%(time.time()-start)

print '\n'
for i in xrange(10):
    start = time.time()
    sum_arr = test.func2(N)
    print 'Time taken = %.3f'%(time.time()-start)

print '\n'
for i in xrange(10):
    start = time.time()
    sum_arr = test.func3(N)
    print 'Time taken = %.3f'%(time.time()-start)
############################################

从python test.py我得到:

Time taken = 1.445
Time taken = 1.433
Time taken = 1.434
Time taken = 1.428
Time taken = 1.449
Time taken = 1.425
Time taken = 1.421
Time taken = 1.451
Time taken = 1.483
Time taken = 1.418

Time taken = 2.623
Time taken = 2.603
Time taken = 2.977
Time taken = 3.237
Time taken = 2.748
Time taken = 2.798
Time taken = 2.811
Time taken = 2.783
Time taken = 2.585
Time taken = 2.595

Time taken = 1.503
Time taken = 1.529
Time taken = 1.509
Time taken = 1.543
Time taken = 1.427
Time taken = 1.425
Time taken = 1.423
Time taken = 1.415
Time taken = 1.414
Time taken = 1.418

我的问题是:为什么func2几乎比func1和func3慢2倍?

有没有办法改善这个?

谢谢!

########更新

我真正的问题如下。我正在调用一个接受3D数组的函数(比如P [i,j,k])。该函数将遍历每个元素并计算几个量:一个数量取决于该位置中数组的值(比如A = f(P [i,j,k]))和另一个仅取决于位置的量数组本身(B = g(i,j,k))。示意图事情将如下所示:

for i in xrange(N):
    corr1 = h(i,val)

    for j in xrange(N):
        corr2 = h(j,val)

        for k in xrange(N):
            corr3 = h(k,val)

            A = f(P[i,j,k])
            B = g(i,j,k)
            Arr[B] += A*corr1*corr2*corr3

其中val是由数字表示的3D数组的属性。对于不同的字段,此数字可能不同。

由于我必须在许多3D数组上执行此操作,我认为如果我创建一个接受许多不同输入3D数组的新例程会更好,但是先验数据未知。我的想法是,由于B将在所有阵列上完全相同,因此我可以避免为每个阵列计算它并仅计算一次。问题是上面的corr1,corr2,corr3将成为数组:

如果我有一些等于num_3D_arrays的3D数组,我正在做的事情:

for i in xrange(N):
    for p in xrange(num_3D_arrays):
        corr1[p] = h(i,val[p])

    for j in xrange(N):
        for p in xrange(num_3D_arrays):
            corr2[p] = h(j,val[p])

        for k in xrange(N):
            for p in xrange(num_3D_arrays):
                corr3[p] = h(k,val[p])

            B = g(i,j,k)
            for p in xrange(num_3D_arrays):
                A[p] = f(P[i,j,k])
                Arr[p,B] += A[p]*corr1[p]*corr2[p]*corr3[p]

因此,我将变量corr1,corr2,corr3和A从标量更改为数组的val正在扼杀我期望避免执行大循环的性能。

3 个答案:

答案 0 :(得分:2)

您可以采取一些措施来加速Cython中的数组索引:

所以对你的功能:

@cython.boundscheck(False)
@cython.wraparound(False)
def func2(long N):

    cdef np.float64_t[::1] sum_arr
    cdef long i

    sum_arr = np.zeros(3,dtype=np.float64)

    for i in range(N):
        sum_arr[0] += i
        sum_arr[1] += 2.0*i
        sum_arr[2] += 3.0*i

    return sum_arr

对于原始代码,Cython为行sum_arr[0] += i生成了以下C代码:

__pyx_t_12 = 0;
__pyx_t_6 = -1;
if (__pyx_t_12 < 0) {
  __pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape;
  if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0;
} else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0;
if (unlikely(__pyx_t_6 != -1)) {
  __Pyx_RaiseBufferIndexError(__pyx_t_6);
  {__pyx_filename = __pyx_f[0]; __pyx_lineno = 13; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
}
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i;

通过以上改进:

__pyx_t_8 = 0;
*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_sum_arr.data) + __pyx_t_8)) )) += __pyx_v_i;

答案 1 :(得分:0)

  • 为什么func2几乎比func1慢2倍?

    这是因为索引会导致间接,因此您将基本操作的数量增加一倍。计算func1中的总和,然后用 sum=array([sum1,sum2,sum3])

  • 如何加速python代码?

    1. Numpy是第一个好主意,它毫不费力地提高C速度。

    2. Numba也可以毫不费力地填补空白,而且非常简单。

    3. 针对重要案例的Cython。

这里有一些例子:

# python way
def func1(N):
    sum1 = 0.0
    sum2 = 0.0
    sum3 = 0.0

    for i in range(N):
        sum1 += i
        sum2 += 2.0*i
        sum3 += 3.0*i

    return sum1,sum2,sum3

# numpy way
def func2(N):
    aran=arange(float(N))
    sum1=aran.sum()
    sum2=(2.0*aran).sum()
    sum3=(3.0*aran).sum()
    return sum1,sum2,sum3

#numba way
import numba    
func3 =numba.njit(func1)

"""
In [609]: %timeit func1(10**6)
1 loop, best of 3: 710 ms per loop

In [610]: %timeit func2(1e6)
100 loops, best of 3: 22.2 ms per loop

In [611]: %timeit func3(10e6)
100 loops, best of 3: 2.87 ms per loop
"""

答案 2 :(得分:0)

查看html生成的cython -a ...pyx

对于func1sum1 += i行扩展为:

+15:         sum1 += i
    __pyx_v_sum1 = (__pyx_v_sum1 + __pyx_v_i);

表示func3,带有C数组

+45:         sum_arr[0] += i
    __pyx_t_3 = 0;
    (__pyx_v_sum_arr[__pyx_t_3]) = ((__pyx_v_sum_arr[__pyx_t_3]) + __pyx_v_i);

稍微复杂一点,但直截了当c

func2

+29:         sum_arr[0] += i
    __pyx_t_12 = 0;
    __pyx_t_6 = -1;
    if (__pyx_t_12 < 0) {
      __pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape;
      if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0;
    } else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0;
    if (unlikely(__pyx_t_6 != -1)) {
      __Pyx_RaiseBufferIndexError(__pyx_t_6);
      __PYX_ERR(0, 29, __pyx_L1_error)
    }
    *__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i;

引用numpy函数(例如Pyx_BUfPtrStrided1d)要复杂得多。即使初始化数组也很复杂:

+26:     sum_arr = np.zeros(3,dtype=np.float64)
  __pyx_t_1 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 26, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
   ....

我希望将sum_arr创建移动到调用Python,并将其作为参数传递给func2将节省一些时间。

您是否阅读了本指南,了解如何使用memoryviews

http://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html

如果您专注于编写低级操作,那么您将获得最佳cython性能,以便将它们转换为简单的c。在

    for k in xrange(N):
        corr3 = h(k,val)

        A = f(P[i,j,k])
        B = g(i,j,k)
        Arr[B] += A*corr1*corr2*corr3

i,j,k上的循环不会减慢你的速度。它每次评估hfg,以及Arr[B] +=...。这些函数应该是紧密编码的cython,而不是一般的Python函数。请查看sum3d指南中memoryview函数的编译简单性。