带线程的Pycuda数组索引&块

时间:2016-04-30 22:46:36

标签: python arrays cuda pycuda

我试图编写一个用于Pycuda的cuda直方图函数。代码似乎迭代的元素多于我传入的数组的大小。为了排除bin计算中的错误,我创建了一个非常简单的内核,我传递了一个2d数组并添加1到处理的每个元素的直方图的第一个桶。我不断获得比我的2d数组更多的元素。

输出应为[size_of_2d_array,0,0,0]。

我在Ubuntu 15.04,python 2.7.9上运行。当我尝试其他人编写的示例时,它们似乎正常工作。

我做错了什么?

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np

#Make the kernel
histogram_kernel = """
__global__ void kernel_getHist(unsigned int* array,unsigned int size, unsigned int lda, unsigned int* histo, float buckets)
{

    unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
    unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
    unsigned int tid = y + lda * x;


    if(tid<size){
        //unsigned int value = array[tid];

        //int bin = floor(value * buckets);

        atomicAdd(&histo[0],1);
    }
}
"""
mod = SourceModule(histogram_kernel)


#2d array to analyze
a = np.ndarray(shape=(2,2))
a[0,0] = 1
a[0,1] =2 
a[1,0] = 3
a[1,1] = 4


#histogram stuff, passing but not using right now
max_val = 4
num_bins = np.uint32(4)
bin_size = 1 / np.uint32(max_val / num_bins)

#send array to the gpu
a = a.astype(np.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)

#send the histogram to the gpu
a_hist = np.ndarray([1,num_bins])
a_hist = a_hist.astype(np.uint32)
a_hist = a_hist * 0
d_hist = cuda.mem_alloc(a_hist.nbytes)
cuda.memcpy_htod(d_hist, a_hist)

#get the function
func = mod.get_function('kernel_getHist')

#get size & lda
a_size = np.uint32(a.size)
a_lda = np.uint32(a.shape[0])

#print size & lda to check
print(a_lda)
print(a_size)

#run function
func(a_gpu, a_size, a_lda,  d_hist, bin_size, block=(16,16,1))

#get histogram back
cuda.memcpy_dtoh(a_hist, d_hist)

#print the histogram
print a_hist
print a

此代码输出以下内容:

2
4
[[6 0 0 0]]
[[ 1.  2.]
 [ 3.  4.]]

但是,它应该输出:

2
4
[[4 0 0 0]]
[[ 1.  2.]
 [ 3.  4.]]

直方图有太多元素,这让我相信我做了tid和size的错误。

有什么想法吗?

1 个答案:

答案 0 :(得分:0)

这里的问题是你没有计算内核中tid的唯一值。如果你在一张纸上做一些简单的算术运算,你应该为blockDim.x = blockDim.y = 16lda = 2得到这个:

x   y   tid 
0   0   0
1   0   2
0   1   1
1   1   3
0   2   2
0   3   3
..  ..  ..

注意最后两个是重复索引。这就是为什么你的代码返回6,有6个线程满足tid < size的{​​{1}}。

你有两个选择来解决这个问题。一种选择是正确计算唯一索引,例如:

size=4

应该有效。或者,在输入数组的每个维度上应用边界:

unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int gda = blockDim.y * gridDim.y;
unsigned int tid =  y + gda * x;

也可能有用。