Question

Cuda printf似乎不尊重__syncthreads（），即使在同一个块中也是如此。特别是，我希望如果我的线程在调用__syncthreads之后打印出一些东西，之后我就会看到所有的打印后面跟着所有的后打印。这不是我所看到的，我想知道我是否遗漏了什么。这是我的代码示例：

#include <stdio.h>
#include <cuda_runtime_api.h>

#define ROUND_UP(x) (((x)&1) + ((x)>>1))
__global__ void test()
{
  int t = threadIdx.x, last = blockDim.x;
  int offset = ROUND_UP(last);

  while (last > 1 && t + offset < last) {
    offset = ROUND_UP(offset);
    last = ROUND_UP(last);
    __syncthreads();
    if (t == 33 || t == 64)
      printf("A: t = %d, last = %d\n", t, last);
  }
  while (last > 1) {
    last = ROUND_UP(last);
    __syncthreads();
    if (t == 33 || t == 64)
      printf("B: t = %d, last = %d\n", t, last);
  }
}

int main()
{
  test<<<1,66>>>();
  cudaDeviceSynchronize();
  return 0;
}

这导致以下输出：

B: t = 64, last = 33
B: t = 64, last = 17
B: t = 33, last = 33
B: t = 64, last = 9
B: t = 33, last = 17
B: t = 64, last = 5
B: t = 33, last = 9
B: t = 64, last = 3
B: t = 33, last = 5
B: t = 64, last = 2
B: t = 33, last = 3
B: t = 64, last = 1
B: t = 33, last = 2
B: t = 33, last = 1

当我读到这个时，线程64已经退出__syncthreads两次，直到第33次线程第二次进入它。这怎么可能？

Answer 1

根据the documentation，

__ syncthreads（）在条件代码中是允许的，但仅当条件在整个线程块中进行相同的求值时

OP的代码似乎违反了这一要求。根据OP的声明，重构代码以解决这个问题导致令人费解的printf观察结果消失。

如果此区域存在问题，cuda-memcheck工具会提供synccheck选项，可用于在不同代码中查找__syncthreads()的无效使用情况。

cuda printf和__syncthreads订购

1 个答案: