Question

在下面的代码中，如何在不使用 atomicAdd 的情况下计算 sum_array 值。

内核方法

__global__ void calculate_sum( int width,
                               int height,
                               int *pntrs,
                               int2 *sum_array )
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if ( row >= height || col >= width ) return;

    int idx = pntrs[ row * width + col ];

    //atomicAdd( &sum_array[ idx ].x, col );

    //atomicAdd( &sum_array[ idx ].y, row );

    sum_array[ idx ].x += col;

    sum_array[ idx ].y += row;
}

启动内核

    dim3 dimBlock( 16, 16 );
    dim3 dimGrid( ( width + ( dimBlock.x - 1 ) ) / dimBlock.x, 
                  ( height + ( dimBlock.y - 1 ) ) / dimBlock.y );

Answer 1

减少是这类问题的总称。请查看presentation以获取进一步说明，或使用Google获取其他示例。

解决此问题的一般方法是在线程块内部对全局内存段进行并行求和，并将结果存储在全局内存中。然后，将部分结果复制到CPU内存空间，使用CPU对部分结果求和，并将结果复制回GPU内存。您可以通过执行部分结果的另一个并行和来避免处理内存。

另一种方法是为CUDA使用高度优化的库，例如Thrust或CUDPP，它们包含执行这些功能的函数。

Answer 2

我的Cuda 非常非常生疏，但这大致是你如何做到的（由“Cuda by Example”提供，我强烈建议你阅读）：

https://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0

对你需要求和的数组做一个更好的分区：CUDA中的线程是轻量级的，但不是很多，你可以只产生两个总和，并希望得到任何性能上的好处。
此时，每个线程的任务是对一部分数据求和：创建一个与线程数一样大的共享int数组，其中每个线程将保存它计算的部分和。
同步线程并减少共享内存数组：

（请将其作为伪代码）

// Code to sum over a slice, essentially a loop over each thread subset
// and accumulate over "localsum" (a local variable)
...

// Save the result in the shared memory
partial[threadidx] = localsum;

// Synchronize the threads:
__syncthreads();

// From now on partial is filled with the result of all computations: you can reduce partial
// we'll do it the illiterate way, using a single thread (it can be easily parallelized)
if(threadidx == 0) {
    for(i = 1; i < nthreads; ++i) {
        partial[0] += partial[i];
    }
}

然后离开：partial [0]将保留你的总和（或计算）。

有关该主题的更严格的讨论以及在大约O（log（n））中运行的缩减算法，请参阅“CUDA by example”中的点积示例。

希望这有帮助

如何在CUDA中不使用原子进行求和计算

2 个答案: