我正在尝试将并行性应用于以下算法。这应该很容易并行化,因为计算对于前三个维(b, i, j)
是独立的。
def nb_forward(np.ndarray[FLOAT64, ndim=4] inputv, np.ndarray[FLOAT64, ndim=4] kernels, np.ndarray[FLOAT64, ndim=1] bias, tuple stride):
cdef unsigned int kernel_size0 = kernels.shape[1], kernel_size1 = kernels.shape[2], \
stride0 = stride[0], stride1 = stride[1], \
num_dim = kernels.shape[0], \
num_filters = kernels.shape[3], \
batch_size = inputv.shape[0]
cdef unsigned int out_size0 = (inputv.shape[1] - kernel_size0) / stride0 + 1, \
out_size1 = (inputv.shape[2] - kernel_size1) / stride1 + 1
cdef double[:, :, :, :] out = np.empty(shape=[batch_size, out_size0, out_size1, num_filters], dtype=np.float64)
cdef unsigned int b, i, j, m, kw, kh, n
cdef unsigned int iin, jin
cdef double acc
with nogil, parallel():
for b in prange(batch_size):
for i in range(out_size0):
for j in range(out_size1):
iin = i*stride0
jin = j*stride1
for n in range(num_filters):
acc = 0.
for kw in range(kernel_size0):
for kh in range(kernel_size1):
for m in range(num_dim):
acc += inputv[b, iin + kw, jin + kh, m] * kernels[m, kw, kh, n]
out[b, i, j, n] = acc + bias[n]
return out
Error:
Cannot read reduction variable in loop body
最初我尝试仅在b
级别进行并行化,因为在b
级进行并行化,i, j
处于像素级别,我不知道是否值得生成很多线程。但我没有成功。
我尝试使用临时数组out_batch
,但作为numpy
数组,它给了我很多问题和
Error: malloc problems
我也试过使用numpy
(double arrays
)代替double [:,:,:]
数组,但它给出了:
Error: Memoryview slices can only be shared in parallel sections
有没有人有想法?有没有办法在b
,i
,j
(或仅b
)级别应用 nogil ,然后压缩数据?
答案 0 :(得分:1)
显然变量acc
在所有线程之间共享,因此它可能会引发条件--Cython正确地不让这段代码编译。
变量acc
不应该在线程之间共享,而是对线程是私有的。但是,根据我的有限知识,还没有办法用cython做到这一点(不确定这个proposal发生了什么)。
通常的解决方法是分配足够大的工作数组tmp
并在tmp[i]
中累积第i个线程的值。通常足够(但并非总是)已经呈现的数组可用于此目的,因此在您的情况下 - 通过acc
替换out[b,i,j,n]
:
for n in range(num_filters):
out[b, i, j, n] = 0.
for kw in range(kernel_size0):
for kh in range(kernel_size1):
for m in range(num_dim):
out[b, i, j, n] += inputv[b, iin + kw, jin + kh, m] * kernels[m, kw, kh, n]
out[b, i, j, n] += bias[n]