CUDA内存检查器的粒度是多少?

时间:2013-02-15 20:47:08

标签: debugging cuda

我有一个在调试版本中工作但在发布版本中失败的内核。我怀疑我正在阅读或写出界限,但CUDA Memory Checker没有显示任何错误。所以,我做了一个测试,我用以下内核编写和读出界限:

__global__ void addKernel(int *c, const int *a, const int *b)
{
  int x[1];
  for (int i(0); i < 100; ++i) {
    x[i] = i;
  }
  int t(0);
  for (int i(0); i < 100; ++i) {
    t += x[i];
  }
  c[0] = t;
}

CUDA内存检查器没有捕获越界写入和读取。通过将循环增加到1,000,000轮,我可以得到一个超出边界的写入,这导致内核暂停在i = 502,586(超出界限2MB)。

这是CUDA内存检查器工作的预期粒度吗?我能做些什么来让它检测小的越界写入(大约几个字节的数量?)

1 个答案:

答案 0 :(得分:1)

我认为你刚刚在这个例子中通过优化来绊倒,正如已经建议的那样。

这是我的测试用例:

$ cat t1130.cu
#include <stdio.h>

__global__ void addKernel(int *c)
{
  int x[1];
  for (int i(0); i < 100; ++i) {
    x[i] = i;
#ifdef FORCE
  printf("%d ", i);
#endif
  }
  int t(0);
  for (int i(0); i < 100; ++i) {
    t += x[i];
  }
  c[0] = t;
}

int main(){

  int *d_c;
  cudaMalloc(&d_c, sizeof(int));
  addKernel<<<1,1>>>(d_c);
  cudaDeviceSynchronize();
}
$ nvcc -o t1130 t1130.cu
$ cuda-memcheck ./t1130
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$ nvcc -DFORCE -o t1130 t1130.cu
$ cuda-memcheck ./t1130
========= CUDA-MEMCHECK
========= Invalid __local__ write of size 4
=========     at 0x00000168 in addKernel(int*)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x00fffd10 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x15865d]
=========     Host Frame:./t1130 [0x16ca1]
=========     Host Frame:./t1130 [0x314b3]
=========     Host Frame:./t1130 [0x27a1]
=========     Host Frame:./t1130 [0x269c]
=========     Host Frame:./t1130 [0x26b6]
=========     Host Frame:./t1130 [0x2600]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21d65]
=========     Host Frame:./t1130 [0x2489]
=========
0 1 2 3 ========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/lib64/libcuda.so.1 [0x2f31b3]
=========     Host Frame:./t1130 [0x354a6]
=========     Host Frame:./t1130 [0x2605]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21d65]
=========     Host Frame:./t1130 [0x2489]
=========
========= ERROR SUMMARY: 2 errors
$

因此,如果我们按原样编译您的代码,它在cuda-memcheck下运行时没有错误。但是,如果我们强制编译器生成实际循环,则会检测并报告无效访问。

同样解释您的评论,如果我们在上述情况下使用-G而不是-DFORCEcuda-memcheck也会检测并报告错误,因为优化已被停用。