我正在尝试使用一些共享内存来计算直方图以提高性能。但是我遇到了一个我似乎无法弄清楚的问题。这是我遇到问题的内核代码。我确信我错过了一些愚蠢的东西,但我无法找到它。
__global__
void histogram_kernel_shared(const unsigned int* const d_vals,
unsigned int* d_histo,
const unsigned int numElems) {
unsigned int gid = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int lid = threadIdx.x;
unsigned int bin = d_vals[gid];
__syncthreads();
__shared__ unsigned int local_bin[1024];
local_bin[lid] = d_histo[lid];
__syncthreads();
if(local_bin[lid] != d_histo[lid])
printf("After copy to local. block = %u, lid = %u, local_bin = %u, d_histo = %u \n", blockIdx.x, lid, local_bin[lid], d_histo[lid]);
__syncthreads();
// If I comment out this line everything works fine.
d_histo[lid] = local_bin[lid];
// Even this leads to some wrong answers. Printouts on the next printf.
// d_histo[lid] = d_histo[lid];
__syncthreads();
if(local_bin[lid] != d_histo[lid])
printf("copy back. block = %u, lid = %u, local_bin = %u, d_histo = %u \n", blockIdx.x, lid, local_bin[lid], d_histo[lid]);
__syncthreads();
atomicAdd(&d_histo[bin], static_cast<unsigned int>(1));
__syncthreads();
// atomicAdd(&local_bin[bin], static_cast<unsigned int>(1));
__syncthreads();
}
内核按如下方式启动
threads = 1024;
blocks = numElems/threads;
histogram_kernel_shared<<<blocks, threads>>>(d_vals, d_histo, numElems);
元素数量为10,240,000
和Bins的数量是1024。
让我烦恼的是为什么作业d_histo[lid] = local_bin[lid];
在这里有所不同。没有它,代码运行正常。但是,由于我刚刚将值复制为local_bin[lid] = d_histo[lid];
,所以不应该通过该assignemtn进行任何改变,为什么local_bin[lid] = d_histo[lid];
也会给出垃圾值呢?
我的猜测是,在其他地方给出了一些奇怪的UB,但在哪里呢?
感谢您的帮助。
答案 0 :(得分:3)
你正在推出10,000个街区:
blocks = numElems/threads;
每个BLOCK 正在写入lid
的前1024个(d_histo
)位置:
d_histo[lid] = local_bin[lid];
由于您有10,000个块都写入相同的位置,因此它们彼此踩踏并相互覆盖。由于块执行的顺序未定义,因此您肯定会得到未定义的行为。