Question

我正在用CUDA中的原子做一些实验。我的一个大问题是当同一个块中运行的两个线程原子地访问同一个地址时，它们是如何表现的。我尝试使用atomicAdd进行一些测试并且它原子地工作但是当我使用atomicCAS尝试下面的代码时，结果不是我所期望的。有人有解释吗？

#include <cuda_runtime.h>
#include <iostream>
#include <cuComplex.h>
using namespace std;
__global__ void kernel(int * pointer)
{
    *pointer=0;
    *(pointer+threadIdx.x+1)=0;
    __syncthreads();
    *(pointer+threadIdx.x+1)=atomicCAS(pointer,0,100);
}
int main(int argc,char ** argv)
{
    int numThreads=40;
    dim3 threadsPerBlock;
    dim3 blocks;
    int o[numThreads+1];
    int * pointer;
    cudaMalloc(&pointer,sizeof(int)*(numThreads+1));
    cudaMemset(pointer,0,sizeof(int)*(numThreads+1));
    threadsPerBlock.x=numThreads;
    threadsPerBlock.y=1;
    threadsPerBlock.z=1;
    blocks.x=1;
    blocks.y=1;
    blocks.z=1;
    kernel <<<threadsPerBlock,blocks>>> (pointer);
    cudaMemcpy(o,pointer,sizeof(int)*(numThreads+1),cudaMemcpyDeviceToHost);



    for (int i=0;i<numThreads+1;i++)
            cout << o[i] << " ";

    cout << endl;

}

在上面的代码中，在同一个块内运行的atomicCAS访问相同的地址进行比较和交换...我的期望是只有一个atomicCAS会找到要比较的值0而其他所有人都会找到100，但是奇怪的是我的程序输出是：

 100 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

即。所有线程都找到要比较的值设置为0。

Answer 1

您已颠倒执行配置参数的顺序。它是<<<gridDim, blockDim>>>，反之亦然。因此，您每个都会启动40个1个线程的块，而不是相反。

这就是你得到你看到的结果的原因 - 因为每个块中只有一个线程运行，所以数组中的最后numThreads-1值始终为零。

如果我交换订单，我会得到这个输出：

100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 0 100 100 100 100 100 100 100

你可以看到所有线程只有一个写100，一个线程写0，正如预期的那样。

Answer 2

你有你的threadperblock并在你的内核调用中向后阻止变量。

而不是：

 kernel <<<threadsPerBlock,blocks>>> (pointer);

这样做：

 kernel <<<blocks, threadsPerBlock>>> (pointer);

然后你会得到正确的输出。

atomicCAS：块内的行为

2 个答案: