为什么我的Cython C功能比它包装的内置功能慢40倍?

时间:2015-05-27 00:12:21

标签: performance cython

我是Cython的新手。为什么我的C函数Numeraire在这一点上只包含一个内置函数,比直接调用内置函数慢得多?

感谢。这是Cython代码(backward.pyx)代码:

import numpy as np
cimport numpy as np

from libc.math cimport exp

cdef double Numeraire(int i1, int i0, np.ndarray[np.int_t, ndim=1] j):
    cdef float rate = 0.05
    return exp(-rate/12*(i1 - i0))

def Slow(np.ndarray[np.float_t, ndim=2] values, int i1, int i0):
    cdef float norm = 0.25
    cdef int i, j0, j1
    cdef np.ndarray[np.int_t, ndim=1] j = np.empty(2, dtype=np.int)
    for i in range(i1-1, i0-1, -1):
        for j0 in range(i+1):
            j[0] = j0
            for j1 in range(i+1):
                j[1] = j1
                values[j0, j1] += (
                    values[j0+1, j1  ] +
                    values[j0  , j1+1] +
                    values[j0+1, j1+1])
                values[j0, j1] *= norm*Numeraire(i+1, i, j)      #4.397s (!)

def Fast(np.ndarray[np.float_t, ndim=2] values, int i1, int i0):
    cdef float norm = 0.25
    cdef int i, j0, j1
    cdef np.ndarray[np.int_t, ndim=1] j = np.empty(2, dtype=np.int)
    for i in range(i1-1, i0-1, -1):
        for j0 in range(i+1):
            j[0] = j0
            for j1 in range(i+1):
                j[1] = j1
                values[j0, j1] += (
                    values[j0+1, j1  ] +
                    values[j0  , j1+1] +
                    values[j0+1, j1+1])
                values[j0, j1] *= norm*exp(-0.05/12*((i+1) - i)) #0.327s

这是时间信息:

In [1]: import numpy as np
In [2]: import backward
In [3]: factors=2
In [4]: i=360
In [5]: %timeit backward.Fast(np.ones([i+1]*factors), i, 0)
10 loops, best of 3: 104 ms per loop
In [6]: %timeit backward.Slow(np.ones([i+1]*factors), i, 0)
1 loops, best of 3: 4.67 s per loop

1 个答案:

答案 0 :(得分:2)

它与您ndarray传递给Numeraire而不使用相关。如果您运行cython -a backward.pyx并查看您首先看到的代码,cdef double Numeraire...行是高亮淡黄色(显示Cython正在那里进行隐藏工作),当您单击该行时,您将获得以下代码

static double __pyx_f_8backward_Numeraire(int __pyx_v_i1, int __pyx_v_i0, CYTHON_UNUSED PyArrayObject *__pyx_v_j) {
  float __pyx_v_rate;
  __Pyx_LocalBuf_ND __pyx_pybuffernd_j;
  __Pyx_Buffer __pyx_pybuffer_j;
  double __pyx_r;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("Numeraire", 0);
  __pyx_pybuffer_j.pybuffer.buf = NULL;
  __pyx_pybuffer_j.refcount = 0;
  __pyx_pybuffernd_j.data = NULL;
  __pyx_pybuffernd_j.rcbuffer = &__pyx_pybuffer_j;
  {
    __Pyx_BufFmt_StackElem __pyx_stack[1];
    if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_j.rcbuffer->pybuffer, (PyObject*)__pyx_v_j, &__Pyx_TypeInfo_nn___pyx_t_5numpy_int_t, PyBUF_FORMAT| PyBUF_STRIDES, 1, 0, __pyx_stack) == -1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 9; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  }
  __pyx_pybuffernd_j.diminfo[0].strides = __pyx_pybuffernd_j.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_j.diminfo[0].shape = __pyx_pybuffernd_j.rcbuffer->pybuffer.shape[0];
/* … */
  /* function exit code */
  __pyx_L1_error:;
  { PyObject *__pyx_type, *__pyx_value, *__pyx_tb;
    __Pyx_ErrFetch(&__pyx_type, &__pyx_value, &__pyx_tb);
    __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_j.rcbuffer->pybuffer);
  __Pyx_ErrRestore(__pyx_type, __pyx_value, __pyx_tb);}
  __Pyx_WriteUnraisable("backward.Numeraire", __pyx_clineno, __pyx_lineno, __pyx_filename, 0);
  __pyx_r = 0;
  goto __pyx_L2;
  __pyx_L0:;
  __pyx_L2:;
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

其中函数的主体位于标记为/* … */的位。

其中一些工作适用于每次Cython调用,但相当一部分与您未使用的ndarray相关,j(例如__pyx_pybuffer_j__pyx_pybuffernd_j

如果从参数列表中删除j,则在有和没有函数调用的情况下速度相同。如果你真的需要j这个函数的非平凡非示例版本,那么有很多选项。

  1. 如果你总是知道' j'将是长度2,你可能只有

    cdef double Numeraire(int i1, int i0, double j0, double j1):

  2. 或者你可以传递一个C风格double*,一个长度,可能是一个步幅(但如果你将j声明为cdef ndarray[...,mode="c"]你不需要那可能会更快。

  3. 最佳选择:最简单的选择是使用new-style Cython typed memoryview interface代替ndarray界面。

  4. 代码:

    cdef double Numeraire(int i1, int i0, long[::1] j):
      # code as before
    
    # then within your calling function
      # ...
      cdef long[::1] j = np.empty(2, dtype=np.int)
      # ...
    

    在这种情况下,这似乎几乎是免费开销(但是,在其他一些情况下,我发现内存视图界面的分数(~1%)较慢,因此它始终不是最好的答案)。