Question

我有一些Cython代码，涉及在Numpy数组（代表BGR图像）上以像素形式进行极其重复的操作，其形式如下：

ctypedef double (*blend_type)(double, double) # function pointer
@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef cnp.ndarray[cnp.float_t, ndim=3] blend_it(const double[:, :, :] array_1, const double[:, :, :] array_2, const blend_type blendfunc, const double opacity):
  # the base layer is a (array_1)
  # the blend layer is b (array_2)
  # base layer is below blend layer
  cdef Py_ssize_t y_len = array_1.shape[0]
  cdef Py_ssize_t x_len = array_1.shape[1]
  cdef Py_ssize_t a_channels = array_1.shape[2]
  cdef Py_ssize_t b_channels = array_2.shape[2]
  cdef cnp.ndarray[cnp.float_t, ndim=3] result = np.zeros((y_len, x_len, a_channels), dtype = np.float_)
  cdef double[:, :, :] result_view = result
  cdef Py_ssize_t x, y, c

  for y in range(y_len):
    for x in range(x_len):
      for c in range(3): # iterate over BGR channels first
        # calculate channel values via blend mode
        a = array_1[y, x, c]
        b = array_2[y, x, c]
        result_view[y, x, c] = blendfunc(a, b)
        # many other operations involving result_view...
  return result;

blendfunc引用另一个cython函数的地方，例如以下overlay_pix：

cdef double overlay_pix(double a, double b):
  if a < 0.5:
    return 2*a*b
  else:
    return 1 - 2*(1 - a)*(1 - b)

使用函数指针的目的是避免必须为每种混合模式（有很多混合模式）一遍又一遍地重写大量的重复代码。因此，我为每种混合模式创建了这样的界面，省去了我的麻烦：

def overlay(double[:, :, :] array_1, double[:, :, :] array_2, double opacity = 1.0):
  return blend_it(array_1, array_2, overlay_pix, opacity)

但是，这似乎花了我一些时间！我注意到，对于非常大的图像（例如8K图像和更大的图像），在blendfunc函数中使用blend_it而不是直接调用overlay_pix时会浪费大量时间在blend_it中。我认为这是因为blend_it在每次迭代中都必须取消引用函数指针，而不是立即使用该函数指针，但是我不确定。

时间损失不是理想的，但是我当然不想一次又一次地为每种混合模式重写blend_it。有什么方法可以避免时间损失？有什么方法可以将函数指针转换为循环外部的局部函数，然后在循环内部更快地访问它？

Answer 1

@ead's answer说了两件有趣的事情：

在C ++中，您将使用模板来代替-这是绝对正确的，并且因为总是在编译时就知道模板类型，所以通常很容易。

Cython和C ++模板有点混乱，因此我认为您不想在这里使用它们。但是Cython确实具有称为fused types的类似模板的功能。您可以使用融合类型来获得编译时优化，如下所示。该代码的大致轮廓是：

为您要执行的所有操作定义一个包含cdef class staticmethod函数的cdef。
定义一个包含所有cdef class的融合类型。（这是这种方法的局限性-它不易扩展，因此，如果要添加操作，则必须编辑代码）
定义一个函数，该函数采用您的融合类型的虚拟参数。使用此类型来调用staticmethod。
定义包装器功能-您需要使用显式[type]语法才能使其正常工作。

代码：

import cython

cdef class Plus:
    @staticmethod
    cdef double func(double x):
        return x+1    

cdef class Minus:
    @staticmethod
    cdef double func(double x):
        return x-1

ctypedef fused pick_func:
    Plus
    Minus

cdef run_func(double [::1] x, pick_func dummy):
    cdef int i
    with cython.boundscheck(False), cython.wraparound(False):
        for i in range(x.shape[0]):
            x[i] = cython.typeof(dummy).func(x[i])
    return x.base

def run_func_plus(x):
    return run_func[Plus](x,Plus())

def run_func_minus(x):
    return run_func[Minus](x,Minus())

为进行比较，使用函数指针的等效代码为

cdef double add_one(double x):
    return x+1

cdef double minus_one(double x):
    return x-1

cdef run_func_ptr(double [::1] x, double (*f)(double)):
    cdef int i
    with cython.boundscheck(False), cython.wraparound(False):
        for i in range(x.shape[0]):
            x[i] = f(x[i])
    return x.base

def run_func_ptr_plus(x):
    return run_func_ptr(x,add_one)

def run_func_ptr_minus(x):
    return run_func_ptr(x,minus_one)

与使用函数指针相比，使用timeit可以使速度提高2.5倍。这表明函数指针并没有为我优化（但是我没有尝试更改编译器设置来尝试改进它）

import numpy as np
import example

# show the two methods give the same answer
print(example.run_func_plus(np.ones((10,))))
print(example.run_func_minus(np.ones((10,))))

print(example.run_func_ptr_plus(np.ones((10,))))
print(example.run_func_ptr_minus(np.ones((10,))))

from timeit import timeit

# timing comparison
print(timeit("""run_func_plus(x)""",
             """from example import run_func_plus
from numpy import zeros
x = zeros((10000,))
""",number=10000))

print(timeit("""run_func_ptr_plus(x)""",
             """from example import run_func_ptr_plus
from numpy import zeros
x = zeros((10000,))
""",number=10000))

Answer 2

的确，使用函数指针可能会有一些小的额外开销，但是大多数情况下，性能下降是由于编译器不再能够内联调用的函数并执行优化（可能的话）。内联。

我想在下面的示例中对此进行演示，该示例比您的示例小一些：

int f(int i){
    return i;
}

int sum_with_fun(){
    int sum=0;
    for(int i=0;i<1000;i++){
        sum+=f(i);
    }
    return sum;
}

typedef int(*fun_ptr)(int);
int sum_with_ptr(fun_ptr ptr){
    int sum=0;
    for(int i=0;i<1000;i++){
        sum+=ptr(i);
    }
    return sum;
}

因此sum f(i) for i=0...999有两个计算版本：具有函数指针和直接函数。

使用-fno-inline进行编译（即禁用内联以使地面平整）时，它们会产生几乎相同的汇编器（在godbolt.org上）-稍微不同的是函数的调用方式：

callq  4004d0 <_Z1fi>  //direct call
...
callq  *%r12           //via ptr

从性能上讲，这不会有太大区别。

但是当我们放下-fno-inline时，编译器会为直接版本大放异彩，因为它正好会出现在godbolt.org上

_Z12sum_with_funv:
        movl    $499500, %eax
        ret

即与未更改的间接版本相比，整个循环在编译期间进行评估，而间接版本需要在运行时执行循环：

_Z12sum_with_ptrPFiiE:
        pushq   %r12
        movq    %rdi, %r12
        pushq   %rbp
        xorl    %ebp, %ebp
        pushq   %rbx
        xorl    %ebx, %ebx
.L5:
        movl    %ebx, %edi
        addl    $1, %ebx
        call    *%r12
        addl    %eax, %ebp
        cmpl    $1000, %ebx
        jne     .L5
        movl    %ebp, %eax
        popq    %rbx
        popq    %rbp
        popq    %r12
        ret

那它在哪里离开你？您可以使用已知的指针包装间接函数，并且机会很高，编译器将能够执行上述优化，例如：

... 
int sum_with_f(){
    return sum_with_ptr(&f);
}

导致（在godbolt.org上）：

_Z10sum_with_fv:
        movl    $499500, %eax
        ret

使用上述方法，您可以依靠编译器（但是现代编译器可以怜悯）进行内联。

还有其他选项，具体取决于您实际使用的内容：

在C ++中，有一个模板可以消除这种重复工作，而不会降低性能。
在C语言中，人们将使用效果相同的宏。
Numpy使用预处理器生成重复代码，例如参见此src-file，在预处理步骤中将从中生成c文件。
pandas对cython代码使用类似于numpy的方法，例如参见hashtable_func_helper.pxi.in-file。

Cython函数指针取消引用时间（与直接调用函数相比）

2 个答案: