Cython并行循环问题

时间:2016-11-06 15:43:48

标签: python performance openmp cython

我使用cython来计算成对距离矩阵,使用自定义指标作为scipy.spatial.distance.pdist的更快替代方案。

我的动机

我的指标的格式为

def mymetric(u,v,w):
     np.sum(w * (1 - np.abs(np.abs(u - v) / np.pi - 1))**2)

并且使用scipy的成对距离可以计算为

x = sp.spatial.distance.pdist(r, metric=lambda u, v: mymetric(u, v, w))

此处,r是一个m - n矩阵的m向量,其维度为nw为& #34;重量"具有dimmension的因子n

因为我的问题m相当高,所以计算速度很慢。对于m = 2000n = 10,大约需要20秒。

Cython的初始解决方案

我在cython中实现了一个简单的函数来计算成对距离并立即得到非常有希望的结果 - 加速超过500x。

import numpy as np
cimport numpy as np
import cython

from libc.math cimport fabs, M_PI

@cython.wraparound(False)
@cython.boundscheck(False)
def pairwise_distance(np.ndarray[np.double_t, ndim=2] r, np.ndarray[np.double_t, ndim=1] w):
    cdef int i, j, k, c, size
    cdef np.ndarray[np.double_t, ndim=1] ans
    size = r.shape[0] * (r.shape[0] - 1) / 2
    ans = np.zeros(size, dtype=r.dtype)
    c = -1
    for i in range(r.shape[0]):
        for j in range(i + 1, r.shape[0]):
            c += 1
            for k in range(r.shape[1]):
                ans[c] += w[k] * (1.0 - fabs(fabs(r[i, k] - r[j, k]) / M_PI - 1.0))**2.0

    return ans

使用OpenMP的问题

我想使用OpenMP进一步加快计算速度,但是,以下解决方案比串行版本慢大约3倍。

import numpy as np
cimport numpy as np

import cython
from cython.parallel import prange, parallel

cimport openmp

from libc.math cimport fabs, M_PI

@cython.wraparound(False)
@cython.boundscheck(False)
def pairwise_distance_omp(np.ndarray[np.double_t, ndim=2] r, np.ndarray[np.double_t, ndim=1] w):
    cdef int i, j, k, c, size, m, n
    cdef np.double_t a
    cdef np.ndarray[np.double_t, ndim=1] ans
    m = r.shape[0]
    n = r.shape[1]
    size = m * (m - 1) / 2
    ans = np.zeros(size, dtype=r.dtype)
    with nogil, parallel(num_threads=8):
        for i in prange(m, schedule='dynamic'):
            for j in range(i + 1, m):
                c = i * (m - 1) - i * (i + 1) / 2 + j - 1
                for k in range(n):
                    ans[c] += w[k] * (1.0 - fabs(fabs(r[i, k] - r[j, k]) / M_PI - 1.0))**2.0

    return ans

我不知道为什么它实际上更慢,但我试图引入以下更改。 这不仅导致性能稍差,而且结果距离ans仅在数组的开头正确计算,其余只是零。通过这个实现的加速是可以忽略不计。

import numpy as np
cimport numpy as np

import cython
from cython.parallel import prange, parallel

cimport openmp

from libc.math cimport fabs, M_PI
from libc.stdlib cimport malloc, free

@cython.wraparound(False)
@cython.boundscheck(False)
def pairwise_distance_omp_2(np.ndarray[np.double_t, ndim=2] r, np.ndarray[np.double_t, ndim=1] w):
    cdef int k, l, c, m, n
    cdef Py_ssize_t i, j, d
    cdef size_t size
    cdef int *ci, *cj

    cdef np.ndarray[np.double_t, ndim=1, mode="c"] ans

    cdef np.ndarray[np.double_t, ndim=2, mode="c"] data
    cdef np.ndarray[np.double_t, ndim=1, mode="c"] weight

    data = np.ascontiguousarray(r, dtype=np.float64)
    weight = np.ascontiguousarray(w, dtype=np.float64)

    m = r.shape[0]
    n = r.shape[1]
    size = m * (m - 1) / 2
    ans = np.zeros(size, dtype=r.dtype)

    cj = <int*> malloc(size * sizeof(int))
    ci = <int*> malloc(size * sizeof(int))

    c = -1
    for i in range(m):
        for j in range(i + 1, m):
            c += 1
            ci[c] = i
            cj[c] = j

    with nogil, parallel(num_threads=8):
        for d in prange(size, schedule='guided'):
            for k in range(n):
                ans[d] += weight[k] * (1.0 - fabs(fabs(data[ci[d], k] - data[cj[d], k]) / M_PI - 1.0))**2.0

    return ans

对于所有功能,我使用以下.pyxbld文件

def make_ext(modname, pyxfilename):
    from distutils.extension import Extension
    return Extension(name=modname,
                     sources=[pyxfilename],
                     extra_compile_args=['-O3', '-march=native', '-ffast-math', '-fopenmp'],
                     extra_link_args=['-fopenmp'],
                     )

摘要

我对cython没有经验,只知道C语言的基础知识。我很欣赏任何可能导致这种意外行为的建议,甚至是如何更好地重新解释我的问题。

最佳串行解决方案(比原始串口快10%)

@cython.cdivision(True)
@cython.wraparound(False)
@cython.boundscheck(False)
def pairwise_distance_2(np.ndarray[np.double_t, ndim=2] r, np.ndarray[np.double_t, ndim=1] w):
    cdef int i, j, k, c, size
    cdef np.ndarray[np.double_t, ndim=1] ans
    cdef np.double_t accumulator, tmp
    size = r.shape[0] * (r.shape[0] - 1) / 2
    ans = np.zeros(size, dtype=r.dtype)
    c = -1
    for i in range(r.shape[0]):
        for j in range(i + 1, r.shape[0]):
            c += 1
            accumulator = 0
            for k in range(r.shape[1]):
                tmp = (1.0 - fabs(fabs(r[i, k] - r[j, k]) / M_PI - 1.0))
                accumulator += w[k] * (tmp*tmp)
            ans[c] = accumulator

    return ans

最佳并行解决方案(比原始并行快1%,比使用8个线程的最佳串行快6倍)

@cython.cdivision(True)
@cython.wraparound(False)
@cython.boundscheck(False)
def pairwise_distance_omp_2d(np.ndarray[np.double_t, ndim=2] r, np.ndarray[np.double_t, ndim=1] w):
    cdef int i, j, k, c, size, m, n
    cdef np.ndarray[np.double_t, ndim=1] ans
    cdef np.double_t accumulator, tmp
    m = r.shape[0]
    n = r.shape[1]
    size = m * (m - 1) / 2
    ans = np.zeros(size, dtype=r.dtype)
    with nogil, parallel(num_threads=8):
        for i in prange(m, schedule='dynamic'):
            for j in range(i + 1, m):
                c = i * (m - 1) - i * (i + 1) / 2 + j - 1
                accumulator = 0
                for k in range(n):
                    tmp = (1.0 - fabs(fabs(r[i, k] - r[j, k]) / M_PI - 1.0))
                    ans[c] += w[k] * (tmp*tmp)

    return ans

未解决的问题:

当我尝试应用答案中提出的accumulator解决方案时,我收到以下错误:

Error compiling Cython file:
------------------------------------------------------------
...
                c = i * (m - 1) - i * (i + 1) / 2 + j - 1
                accumulator = 0
                for k in range(n):
                    tmp = (1.0 - fabs(fabs(r[i, k] - r[j, k]) / M_PI - 1.0))
                    accumulator += w[k] * (tmp*tmp)
                ans[c] = accumulator
                                   ^
------------------------------------------------------------
pdist.pyx:207:36: Cannot read reduction variable in loop body

完整代码:

@cython.cdivision(True)
@cython.wraparound(False)
@cython.boundscheck(False)
def pairwise_distance_omp(np.ndarray[np.double_t, ndim=2] r, np.ndarray[np.double_t, ndim=1] w):
    cdef int i, j, k, c, size, m, n
    cdef np.ndarray[np.double_t, ndim=1] ans
    cdef np.double_t accumulator, tmp
    m = r.shape[0]
    n = r.shape[1]
    size = m * (m - 1) / 2
    ans = np.zeros(size, dtype=r.dtype)
    with nogil, parallel(num_threads=8):
        for i in prange(m, schedule='dynamic'):
            for j in range(i + 1, m):
                c = i * (m - 1) - i * (i + 1) / 2 + j - 1
                accumulator = 0
                for k in range(n):
                    tmp = (1.0 - fabs(fabs(r[i, k] - r[j, k]) / M_PI - 1.0))
                    accumulator += w[k] * (tmp*tmp)
                ans[c] = accumulator

    return ans

1 个答案:

答案 0 :(得分:3)

我没有自己计时,所以这可能不会太多,但是:

如果您运行@NgModule()以获得初始尝试的注释版本(cython -a),您会发现pairwise_distance_omp行为黄色,表明它已获得Python开销。看一下对应于该行的C表明它正在检查除以零。它的一个关键部分开始:

ans[c] += ...

你知道这永远不会是真的(无论如何你可能会使用NaN值而不是例外情况)。您可以通过向函数添加以下额外装饰器来避免此检查:

if (unlikely(M_PI == 0)) {

这削减了相当多的C代码,包括必须在单个线程中运行的位。另一方面,大多数代码都不应该运行,编译器应该能够解决这个问题,因此不清楚将会产生多大的差异。

第二个建议:

@cython.cdivision(True)
# other decorators
def pairwise_distance_omp # etc...

这有两个好处:1)# at the top cdef np.double_t accumulator, tmp # further down later in the loop: c = i * (m - 1) - i * (i + 1) / 2 + j - 1 accumulator = 0 for k in range(r.shape[1]): tmp = (1.0 - fabs(fabs(r[i, k] - r[j, k]) / M_PI - 1.0)) accumulator = accumulator + w[k] * (tmp*tmp) ans[c] = accumulator 应该比浮点指数更快到2的幂.2)你避免从tmp*tmp数组读取,这可能有点慢因为编译器总是要小心其他一些线程没有改变它(即使你知道它不应该有)。