如何在cython中使用并行性

时间:2018-06-18 15:27:41

标签: python c++ parallel-processing openmp cython

我正在尝试将并行性应用于以下算法。这应该很容易并行化,因为计算对于前三个维(b, i, j)是独立的。

def nb_forward(np.ndarray[FLOAT64, ndim=4] inputv, np.ndarray[FLOAT64, ndim=4] kernels, np.ndarray[FLOAT64, ndim=1] bias, tuple stride):
    cdef unsigned int kernel_size0 = kernels.shape[1], kernel_size1 = kernels.shape[2], \
    stride0 = stride[0], stride1 = stride[1], \
    num_dim = kernels.shape[0], \
    num_filters = kernels.shape[3], \
    batch_size = inputv.shape[0]

    cdef unsigned int out_size0 = (inputv.shape[1] - kernel_size0) / stride0 + 1, \
    out_size1 = (inputv.shape[2] - kernel_size1) / stride1 + 1

    cdef double[:, :, :, :] out = np.empty(shape=[batch_size, out_size0, out_size1, num_filters], dtype=np.float64)

    cdef unsigned int b, i, j, m, kw, kh, n
    cdef unsigned int iin, jin
    cdef double acc

    with nogil, parallel():
        for b in prange(batch_size):
            for i in range(out_size0):
                for j in range(out_size1):
                    iin = i*stride0
                    jin = j*stride1

                    for n in range(num_filters):
                        acc = 0.
                        for kw in range(kernel_size0):
                            for kh in range(kernel_size1):
                                for m in range(num_dim):
                                    acc += inputv[b, iin + kw, jin + kh, m] * kernels[m, kw, kh, n]
                        out[b, i, j, n] = acc + bias[n]
    return out

Error:
Cannot read reduction variable in loop body

最初我尝试仅在b级别进行并行化,因为在b级进行并行化,i, j处于像素级别,我不知道是否值得生成很多线程。但我没有成功。

我尝试使用临时数组out_batch,但作为numpy数组,它给了我很多问题和

Error: malloc problems

我也试过使用numpydouble arrays)代替double [:,:,:]数组,但它给出了:

Error: Memoryview slices can only be shared in parallel sections

有没有人有想法?有没有办法在bij(或仅b)级别应用 nogil ,然后压缩数据?

1 个答案:

答案 0 :(得分:1)

显然变量acc在所有线程之间共享,因此它可能会引发条件--Cython正确地不让这段代码编译。

变量acc不应该在线程之间共享,而是对线程是私有的。但是,根据我的有限知识,还没有办法用cython做到这一点(不确定这个proposal发生了什么)。

通常的解决方法是分配足够大的工作数组tmp并在tmp[i]中累积第i个线程的值。通常足够(但并非总是)已经呈现的数组可用于此目的,因此在您的情况下 - 通过acc替换out[b,i,j,n]

for n in range(num_filters):
    out[b, i, j, n] = 0.
    for kw in range(kernel_size0):
        for kh in range(kernel_size1):
            for m in range(num_dim):
                out[b, i, j, n] += inputv[b, iin + kw, jin + kh, m] * kernels[m, kw, kh, n]
    out[b, i, j, n] += bias[n]