Question

我想阻止一些块，直到一个变量设置为特定值。所以我编写这段代码来测试一个简单的do-while循环是否有效。

__device__ int tag = 0;
__global__ void kernel() {
    if ( threadIdx.x == 0 ) {
        volatile int v;
        do {
            v = tag;
        }
        while ( v == 0 );
    }
    __syncthreads();
    return ;
}

然而，它不起作用（没有发生死循环，非常奇怪）。

我想询问是否有任何其他方法能够阻止某些块，直到满足某些条件或者代码的某些更改才有效。

Answer 1

目前还没有可靠的方法在CUDA中执行块间同步。

有一些hacky方法可以在具有适度总线程数的块之间实现某种方式的锁定或阻塞，但它们会在执行模型中利用未定义的行为，这些行为无法保证在所有硬件上以相同的方式运行或继续工作在将来。确保块之间同步或阻塞的唯一可靠方法是我们单独的内核启动。如果在没有块间同步的情况下无法使算法正常工作，则需要新的算法，或者您的应用程序非常适合GPU架构。

Answer 2

这是一种我试图看它是否有用的hackish方式。

#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>

__global__ static
void kernel(int *count, float *data)
{
    count += threadIdx.x;
    data += gridDim.x * threadIdx.x;
    int i = blockIdx.x;
    if (i < gridDim.x - 1) {
        data[i] = i + 1;
        atomicAdd(count, 1);
        return;
    }

    while (atomicMin(count, i) != i);

    float tmp = i + 1;
    for (int j = 0; j < i; j++) tmp += data[j];

    data[i] = tmp;
}

int main(int argc, char **args)
{
        int num = 100;
    if (argc >= 2) num = atoi(args[1]);

    int bytes = num * sizeof(float) * 32;
    float *d_data; cudaMalloc((void **)&d_data, bytes);
    float *h_data = (float *)malloc(bytes);
    for (int i = 0; i < 32 * num; i++) h_data[i] = -1; // Being safe                                                                                                                           

    int h_count[32] = {1};
    int *d_count; cudaMalloc((void **)&d_count, 32 * sizeof(int));
    cudaMemcpy(d_count, &h_count, 32 * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);
    kernel<<<num, 32>>>(d_count, d_data);
    cudaMemcpy(&h_count, d_count, 32 * sizeof(int), cudaMemcpyDeviceToHost);
    cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost);

    for (int i = 0; i < 32; i++) {
        printf("sum of first %d from thread %d is %d \n", num, i, (int)h_data[num -1]);
        h_data += num;
    }

    cudaFree(d_count);
    cudaFree(d_data);
    free(h_data - num * 32);
}

我无法保证这将永远有效。但我的卡（320M）上的断点似乎是num = 5796.也许每张卡的硬件限制有所不同？

修改

答案是n *（n + 1）/ 2>对于n> 2 ^ 24; 5795（这是单精度限制）。超出此点的整数值的准确性未定义。感谢talonmies指出它。

./a.out 5795 sum of first 5795 from thread 0 is 16793910 sum of first 5795 from thread 1 is 16793910 sum of first 5795 from thread 2 is 16793910 sum of first 5795 from thread 3 is 16793910 sum of first 5795 from thread 4 is 16793910 sum of first 5795 from thread 5 is 16793910 sum of first 5795 from thread 6 is 16793910 sum of first 5795 from thread 7 is 16793910 sum of first 5795 from thread 8 is 16793910 sum of first 5795 from thread 9 is 16793910 sum of first 5795 from thread 10 is 16793910 sum of first 5795 from thread 11 is 16793910 sum of first 5795 from thread 12 is 16793910 sum of first 5795 from thread 13 is 16793910 sum of first 5795 from thread 14 is 16793910 sum of first 5795 from thread 15 is 16793910 sum of first 5795 from thread 16 is 16793910 sum of first 5795 from thread 17 is 16793910 sum of first 5795 from thread 18 is 16793910 sum of first 5795 from thread 19 is 16793910 sum of first 5795 from thread 20 is 16793910 sum of first 5795 from thread 21 is 16793910 sum of first 5795 from thread 22 is 16793910 sum of first 5795 from thread 23 is 16793910 sum of first 5795 from thread 24 is 16793910 sum of first 5795 from thread 25 is 16793910 sum of first 5795 from thread 26 is 16793910 sum of first 5795 from thread 27 is 16793910 sum of first 5795 from thread 28 is 16793910 sum of first 5795 from thread 29 is 16793910 sum of first 5795 from thread 30 is 16793910 sum of first 5795 from thread 31 is 16793910

-

我编辑了我以前使用一个块的代码。这更能代表现实世界的线程/块（内存访问很奇怪，并且会很慢，但它们是为了快速移植我的旧测试代码以使用多个线程而完成的）。

看起来有些情况下你可以跨块同步，但主要取决于你事先知道某些事情（对于这个特殊情况，我只是在最后一个块上执行一个疯狂无用的计数之前同步n - 1个块）。

这只是一个概念验证，不要认真对待代码

是否有方法可以阻止某些块，直到某些条件满足为止？

2 个答案: