Question

我在CUDA中遇到了一个奇怪且难以重现的问题，结果发现涉及未定义的行为。我希望线程0在共享内存中设置一些值，所有线程都应该使用它。

__shared__ bool p;
p = false;
if (threadIdx.x == 0) p = true;
__syncthreads();
assert(p);

现在assert(p);似乎随机失败了，因为我把代码推了推，并对其进行了评论以找出问题。

我在以下未定义行为上下文中有效地使用了这种结构：

#include <assert.h>

__global__ void test() {
    if (threadIdx.x == 0) __syncthreads(); // call __syncthreads in thread 0 only: this is a very bad idea
    // everything below may exhibit undefined behaviour


    // If the above __syncthreads runs only in thread 0, this will fail for all threads not in the first warp
    __shared__ bool p;
    p = false;
    if (threadIdx.x == 0) p = true;
    __syncthreads();
    assert(p);
}

int main() {
    test << <1, 32 + 1 >> > (); // nothing happens if you have only one warp, so we use one more thread
    cudaDeviceSynchronize();
    return 0;
}

当前只有一个帖子到达的早期__synchthreads()当然隐藏在某些功能中，所以很难找到。在我的设置（sm50，gtx 980）上，这个内核运行（没有宣告的死锁......）并且第一个warp之外的所有线程的断言都失败。

TL; DR

是否有任何标准方法可以检测到__syncthreads()未被块中的所有线程调用？也许我错过了一些调试器设置？

我可以构建我自己的（非常慢的）checked__syncthreads()，它可以使用原子和全局记忆来检测情况，但我宁愿有一个标准的解决方案。

Answer 1

您的原始代码中存在线程数据竞争条件线程0可以前进并执行＆＃34; p = true＆＃34;但在此之后，不同的线程可能根本没有进展并且仍然会返回到p = false行，覆盖结果。

这个特定示例的最简单修复只是让线程0写入p，类似于

__shared__ bool p;
if (threadIdx.x == 0) p = true; 
__syncthreads();
assert(p);

在CUDA中，如何检测块中的所有线程都没有调用__syncthreads（）？

1 个答案: