Question

我是CUDA的新手。我试图在GPU上实现trie数据结构，但它没有用。我注意到我的atomicAdd没有像我预期的那样工作。所以我用atomicAdd做了一些实验。我写了这段代码：

#include <cstdio>

//__device__ int *a; //I also tried the code with using this __device__
                     //variable and allocating it inside kernel instead
                     //using cudaMalloc. Same Result

__global__ void AtomicTestKernel (int*a)
{
    *a = 0;
    __syncthreads();
    for (int i = 0; i < 2; i++)
    {
        if (threadIdx.x % 2)
        {
            atomicAdd(a, 1);
            printf("threadsIndex = %d\t&\ta : %d\n",threadIdx.x,*a);
        }
        else
        {
            atomicAdd(a, 1);
            printf("threadsIndex = %d\t&\ta : %d\n", threadIdx.x, *a);
        }
    }
}

int main()
{
    int * d_a;
    cudaMalloc((void**)&d_a, sizeof(int));

    AtomicTestKernel << <1, 10 >> > (d_a);

    cudaDeviceSynchronize();

    return 0;
}

纠正我对这段代码的错误：

1 - 根据CUDA的编程指南:(关于原子功能）

...换句话说，没有其他线程可以访问此地址，直到操作完成

2 - int * d_a驻留在全局内存中，内核的输入也是如此：int * a 因为它是使用cudaMalloc分配的（根据这个3分钟的视频：Udacity CUDA - Global Memory），因此所有线程都看到相同的int * a而不是每个线程都拥有它自己的

3 - 在每个printf之前的代码中都有一个atomicAdd，所以我希望每个printf的值*a与之前的*a不同，因此唯一的。

但是在结果中，我看到了threadsIndex = 0 & a : 5 threadsIndex = 2 & a : 5 threadsIndex = 4 & a : 5 threadsIndex = 6 & a : 5 threadsIndex = 8 & a : 5 threadsIndex = 1 & a : 10 threadsIndex = 3 & a : 10 threadsIndex = 5 & a : 10 threadsIndex = 7 & a : 10 threadsIndex = 9 & a : 10 threadsIndex = 0 & a : 15 threadsIndex = 2 & a : 15 threadsIndex = 4 & a : 15 threadsIndex = 6 & a : 15 threadsIndex = 8 & a : 15 threadsIndex = 1 & a : 20 threadsIndex = 3 & a : 20 threadsIndex = 5 & a : 20 threadsIndex = 7 & a : 20 threadsIndex = 9 & a : 20 Press any key to continue . . .这么多相同的变量这是我得到的结果：

public class MyAuthorize
{
   private readonly RequestDelegate _next;
   public MyAuthorize(RequestDelegate next)
   {
      _next = next;
   }

   public async Task Invoke(HttpContext httpContext)
   {
     // authorize request source here.

    await _next(httpContext);
   }
}

Answer 1

由于所有指令都在warp中同时执行，因此代码执行所有原子指令然后执行printf，因此，您正在读取所有原子操作的结果。

这是warp中指令的执行：

Instruction | threadId 1       | threadId 2       | *a        
____________________________________________________________
AtomicAdd   | increasing value | waiting          | 1  
              waiting          | increasing value | 2
---------------------------------------------- Warp finished instruction of all AtomicAdd
reading *a  | read value       | read value       | 2

读取原子操作的先前值，检查方法atomicAdd的结果

@app.route()

您可以在此处获得一些信息：https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd

为什么在每个atomicAdd之后我没有看到变量值的不同/唯一输出？

1 个答案: