Cython Numpy阵列操作比Python慢​​

时间:2018-06-19 14:34:47

标签: python arrays numpy optimization cython

我想用Cython优化此Python代码:

def updated_centers(point, start, center):
    return np.array([__cluster_mean(point[start[c]:start[c + 1]], center[c]) for c in range(center.shape[0])])

def __cluster_mean(point, center):
    return (np.sum(point, axis=0) + center) / (point.shape[0] + 1)

我的Cython代码:

cimport cython
cimport numpy as np
import numpy as np

# C-compatible Numpy integer type.                                                                                        
DTYPE = np.intc

@cython.boundscheck(False)  # Deactivate bounds checking                                                                  
@cython.wraparound(False)   # Deactivate negative indexing.                                                               
@cython.cdivision(True)     # Deactivate division by 0 checking.                                                          
def updated_centers(double [:,:] point, int [:] label, double [:,:] center):
    if (point.shape[0] != label.size) or (point.shape[1] != center.shape[1]) or (center.shape[0] > point.shape[0]):
    raise ValueError("Incompatible dimensions")

    cdef Py_ssize_t i, c, j
    cdef Py_ssize_t n = point.shape[0]
    cdef Py_ssize_t m = point.shape[1]
    cdef Py_ssize_t nc = center.shape[0]

    # Updated centers. We accumulate point and center contributions into this array.                                      
    # Start by adding the (unscaled) center contributions.                                                                
    new_center = np.zeros([nc, m])
    new_center[:] = center

    # Counter array. Will contain cluster sizes (including center, whose contribution                                     
    # is again added here) at the end of the point loop.                                                                  
    cluster_size = np.ones([nc], dtype=DTYPE)

    # Add point contributions.                                                                                            
    for i in range(n):
        c = label[i]
        cluster_size[c] += 1
        for j in range(m):
            new_center[c, j] += point[i, j]

    # Scale center+point summation to be a mean.                                                                          
    for c in range(nc):
        for j in range(m):
        new_center[c, j] /= cluster_size[c]

    return new_center

但是,Cython的速度比python慢​​:

Python: %timeit f.updated_centers(point, start, center)
331 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Cython: %timeit fx.updated_centers(point, label, center)
433 ms ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

HTML显示几乎所有行都是黄色的:分配数组+ =,/ =。我期望Cython快一个数量级。我在做什么错了?

3 个答案:

答案 0 :(得分:0)

关键是编写类似于Python代码的Cython代码,仅在必要时访问数组。

cimport cython
cimport numpy as np
import numpy as np

# C-compatible Numpy integer type.
DTYPE = np.intc


@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)  # Deactivate negative indexing.
@cython.cdivision(True)  # Deactivate division by 0 checking.
def updated_centers(double [:, :] point, int [:] start, double [:, :] center):
"""Returns the updated list of cluster centers (damped center of mass Pahkira scheme). Cluster c
(and center[c]) corresponds to the point range point[start[c]:start[c+1]]."""
if (point.shape[1] != center.shape[1]) or (center.shape[0] > point.shape[0]) or (start.size != center.shape[0] + 1):
    raise ValueError("Incompatible dimensions")

# Py_ssize_t is the proper C type for Python array indices.
cdef Py_ssize_t i, c, j, cluster_start, cluster_stop, cluster_size
cdef Py_ssize_t n = point.shape[0]
cdef Py_ssize_t m = point.shape[1]
cdef Py_ssize_t nc = center.shape[0]
cdef double center_of_mass

# Updated centers. We accumulate point and center contributions into this array.
# Start by adding the (unscaled) center contributions.
new_center = np.zeros([nc, m])

cluster_start = start[0]
for c in range(nc):
    cluster_stop = start[c + 1]
    cluster_size = cluster_stop - cluster_start + 1 
    for j in range(m):
    center_of_mass = center[c, j]
    for i in range(cluster_start, cluster_stop):
        center_of_mass += point[i, j]
    new_center[c, j] = center_of_mass / cluster_size
    cluster_start = cluster_stop

return np.asarray(new_center)

我们使用相同的API

n, m = 100000, 5; k = n//2; point = np.random.rand(n, m); start = 2*np.arange(k+1, dtype=np.intc); center=np.random.rand(k, m);

%timeit fx.updated_centers(point, start, center)
31 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit f.updated_centers(point, start, center)
734 ms ± 17.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

答案 1 :(得分:0)

您需要告诉Cython new_centercluster_size是数组:

cdef double[:, :] new_center = np.zeros((nc, m))
...
cdef int[:] cluster_size = np.ones((nc,), dtype=DTYPE)
...

没有这些类型注释,Cython无法生成有效的C代码,并且在访问这些数组时必须调用Python解释器。这就是为什么访问这些数组的cython -a的HTML输出中的行为黄色的原因

仅通过这两个小修改,我们立即看到我们想要的加速:

%timeit python_updated_centers(point, start, center)
392 ms ± 41.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit cython_updated_centers(point, start, center)
1.18 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

答案 2 :(得分:0)

对于这样简单的内核,您还可以使用pythran获得不错的加速比:

#pythran export updated_centers(float64 [:, :], int32 [:] , float64 [:, :] )
import numpy as np
def updated_centers(point, start, center):
    return np.array([__cluster_mean(point[start[c]:start[c + 1]], center[c]) for c in range(center.shape[0])])

def __cluster_mean(point, center):
    return (np.sum(point, axis=0) + center) / (point.shape[0] + 1)

使用pythran updated_centers.py进行编译并获得以下计时:

Numpy代码(相同的代码,未编译):

$ python -m perf timeit -s 'import numpy as np; n, m = 100000, 5; k = n//2; point = np.random.rand(n, m); start = 2*np.arange(k+1, dtype=np.int32); center=np.random.rand(k, m); from updated_centers import updated_centers' 'updated_centers(point, start, center)'
.....................
Mean +- std dev: 271 ms +- 12 ms

Pythran(编译后):

$ python -m perf timeit -s 'import numpy as np; n, m = 100000, 5; k = n//2; point = np.random.rand(n, m); start = 2*np.arange(k+1, dtype=np.int32); center=np.random.rand(k, m); from updated_centers import updated_centers' 'updated_centers(point, start, center)'
.....................
Mean +- std dev: 12.8 ms +- 0.3 ms