Question

在我的内核中，线程正在处理全局内存中数组的一小部分。在处理之后，我还想设置一个标志，指示块内所有线程的计算结果为零：

__global__ void kernel( int *a, bool *blockIsNull) { 
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  int result = 0;
  // {...} Here calculate result
  a[tid] = result;

  // some code here, but I don't know, that's my question...
  if (condition)
    blockIsNull[blockIdx.x] = true; // if all threads have returned result==0
}

每个单独的主题都拥有该信息。但我找不到一种有效的方法来收集它。

例如，我可以在共享内存中有一个计数器，在result==0时由每个线程以原子方式递增。因此，当计数器达到blockDim.x时，意味着所有线程都返回零。虽然未经测试，但我担心此解决方案会对性能产生负面影响（原子功能很慢）。

零结果不会经常发生，因此块内的所有线程都不太可能有零。我想找到一个对一般情况下的性能影响不大的解决方案。

你的建议是什么？

Answer 1

听起来你想要在块中执行块条件值的块级减少。几乎所有CUDA硬件都支持一组非常有用的warp投票原语。您可以使用__all() warp vote来确定每个线程的warp满足条件，然后再次使用__all()来检查所有warp是否满足条件。在代码中，它可能如下所示：

__global__ void kernel( int *a, bool *blockIsNull) { 

    // assume that threads per block is <= 1024
    __shared__ volatile int blockcondition[32];
    int laneid = threadIdx.x % 32;
    int warpid = threadIdx.x / 32;

    // Set each condition value to non zero to begin
    if (warpid == 0) {
        blockcondition[threadIdx.x] = 1;
    }
    __syncthreads();

    //
    // your code goes here
    //

    // warpcondition holds the vote from each warp
    int warpcondition = __all(condition);

    // First thread in each warp loads the warp vote to shared memory
    if (laneid == 0) {
        blockcondition[warpid] = warpcondition;
    }
    __syncthreads();

    // First warp reduces all the votes in shared memory
    if (warpid == 0) {
        int result = __all(blockcondition[threadIdx.x] != 0);

        // first thread stores the block result to global memory
        if (laneid == 0) {
             blockIsNull[blockIdx.x] = (result !=0);     
        }
    }
}

[巨大的免责声明：用浏览器编写，从未编译或测试，使用风险自负]

此代码应该（我认为）适用于每个块的任意数量的线程，最多1024个。如果您对上面的块大小有信心，可以根据需要将blockcondition的大小调整为更小的值限制小于1024.可能最聪明的方法是使用C ++模板并使warp计数限制为模板参数。

如何收集块内线程的单个结果？

1 个答案: