我试图编写一个用于Pycuda的cuda直方图函数。代码似乎迭代的元素多于我传入的数组的大小。为了排除bin计算中的错误,我创建了一个非常简单的内核,我传递了一个2d数组并添加1到处理的每个元素的直方图的第一个桶。我不断获得比我的2d数组更多的元素。
输出应为[size_of_2d_array,0,0,0]。
我在Ubuntu 15.04,python 2.7.9上运行。当我尝试其他人编写的示例时,它们似乎正常工作。
我做错了什么?
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
#Make the kernel
histogram_kernel = """
__global__ void kernel_getHist(unsigned int* array,unsigned int size, unsigned int lda, unsigned int* histo, float buckets)
{
unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int tid = y + lda * x;
if(tid<size){
//unsigned int value = array[tid];
//int bin = floor(value * buckets);
atomicAdd(&histo[0],1);
}
}
"""
mod = SourceModule(histogram_kernel)
#2d array to analyze
a = np.ndarray(shape=(2,2))
a[0,0] = 1
a[0,1] =2
a[1,0] = 3
a[1,1] = 4
#histogram stuff, passing but not using right now
max_val = 4
num_bins = np.uint32(4)
bin_size = 1 / np.uint32(max_val / num_bins)
#send array to the gpu
a = a.astype(np.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
#send the histogram to the gpu
a_hist = np.ndarray([1,num_bins])
a_hist = a_hist.astype(np.uint32)
a_hist = a_hist * 0
d_hist = cuda.mem_alloc(a_hist.nbytes)
cuda.memcpy_htod(d_hist, a_hist)
#get the function
func = mod.get_function('kernel_getHist')
#get size & lda
a_size = np.uint32(a.size)
a_lda = np.uint32(a.shape[0])
#print size & lda to check
print(a_lda)
print(a_size)
#run function
func(a_gpu, a_size, a_lda, d_hist, bin_size, block=(16,16,1))
#get histogram back
cuda.memcpy_dtoh(a_hist, d_hist)
#print the histogram
print a_hist
print a
此代码输出以下内容:
2
4
[[6 0 0 0]]
[[ 1. 2.]
[ 3. 4.]]
但是,它应该输出:
2
4
[[4 0 0 0]]
[[ 1. 2.]
[ 3. 4.]]
直方图有太多元素,这让我相信我做了tid和size的错误。
有什么想法吗?
答案 0 :(得分:0)
这里的问题是你没有计算内核中tid
的唯一值。如果你在一张纸上做一些简单的算术运算,你应该为blockDim.x = blockDim.y = 16
和lda = 2
得到这个:
x y tid
0 0 0
1 0 2
0 1 1
1 1 3
0 2 2
0 3 3
.. .. ..
注意最后两个是重复索引。这就是为什么你的代码返回6,有6个线程满足tid < size
的{{1}}。
你有两个选择来解决这个问题。一种选择是正确计算唯一索引,例如:
size=4
应该有效。或者,在输入数组的每个维度上应用边界:
unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int gda = blockDim.y * gridDim.y;
unsigned int tid = y + gda * x;
也可能有用。