Cuda复制并复制给出随机值。

时间:2013-12-16 16:50:23

标签: cuda gpu gpgpu

我正在尝试使用一些共享内存来计算直方图以提高性能。但是我遇到了一个我似乎无法弄清楚的问题。这是我遇到问题的内核代码。我确信我错过了一些愚蠢的东西,但我无法找到它。

__global__
 void histogram_kernel_shared(const unsigned int* const d_vals,
                    unsigned int* d_histo,
                    const unsigned int numElems) {

    unsigned int gid = threadIdx.x + blockDim.x * blockIdx.x;
    unsigned int lid = threadIdx.x;

    unsigned int bin = d_vals[gid];
    __syncthreads();

    __shared__ unsigned int local_bin[1024];

    local_bin[lid] = d_histo[lid];
    __syncthreads();

    if(local_bin[lid] != d_histo[lid])
        printf("After copy to local. block = %u, lid = %u, local_bin = %u, d_histo = %u \n", blockIdx.x, lid, local_bin[lid], d_histo[lid]);

    __syncthreads();

    // If I comment out this line everything works fine.
    d_histo[lid] = local_bin[lid];  

    // Even this leads to some wrong answers. Printouts on the next printf.
    // d_histo[lid] = d_histo[lid];  
     __syncthreads();

    if(local_bin[lid] != d_histo[lid])
        printf("copy back. block = %u, lid = %u, local_bin = %u, d_histo = %u \n", blockIdx.x, lid, local_bin[lid], d_histo[lid]);

    __syncthreads();

    atomicAdd(&d_histo[bin], static_cast<unsigned int>(1));

    __syncthreads();

    // atomicAdd(&local_bin[bin], static_cast<unsigned int>(1));
    __syncthreads();

}

内核按如下方式启动

threads = 1024;
blocks = numElems/threads;
histogram_kernel_shared<<<blocks, threads>>>(d_vals, d_histo, numElems);

元素数量为10,240,000
和Bins的数量是1024。

让我烦恼的是为什么作业d_histo[lid] = local_bin[lid];在这里有所不同。没有它,代码运行正常。但是,由于我刚刚将值复制为local_bin[lid] = d_histo[lid];,所以不应该通过该assignemtn进行任何改变,为什么local_bin[lid] = d_histo[lid];也会给出垃圾值呢?

我的猜测是,在其他地方给出了一些奇怪的UB,但在哪里呢?

感谢您的帮助。

1 个答案:

答案 0 :(得分:3)

你正在推出10,000个街区:

blocks = numElems/threads;

每个BLOCK 正在写入lid的前1024个(d_histo)位置:

d_histo[lid] = local_bin[lid]; 

由于您有10,000个块都写入相同的位置,因此它们彼此踩踏并相互覆盖。由于块执行的顺序未定义,因此您肯定会得到未定义的行为。