Question

我很难理解我在一个简单的Cuda内核中遇到的错误。我将内核缩小到仍然显示错误的最小值。

我有一个“Polygon”类，只存储了许多点。我有一个“添加点”的功能（只是递增计数器），我在我的多边形数组中为所有多边形添加4个点。最后，我调用一个使用循环更新点数的函数。如果，在这个循环中，我调用new_nbpts++一次，我得到了预期的答案：所有多边形都有4个点。如果在同一个循环中我第二次调用new_nbpts++，那么我的多边形有一个垃圾点数（4194304点），这是不正确的（我应该得到8）。

我希望有些东西我误解了。

完整内核：

#include <stdio.h>
#include <cuda.h>


class Polygon {
public:
  __device__ Polygon():nbpts(0){};
  __device__ void addPt() {
    nbpts++;
  }; 
  __device__ void update() {
    int new_nbpts = 0;
    for (int i=0; i<nbpts; i++) {
        new_nbpts++;
        new_nbpts++;  // calling that a second time screws up my result
    }
    nbpts = new_nbpts;
  }

 int nbpts;
};


__global__ void cut_poly(Polygon* polygons, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx>=N) return;

  Polygon pol;
  pol.addPt();
  pol.addPt();
  pol.addPt();
  pol.addPt();

  for (int i=0; i<N; i++) {
    pol.update();
  }

  polygons[idx] = pol;
}



int main(int argc, unsigned char* argv[])
{
  const int N = 20; 
  Polygon p_h[N], *p_d;

  cudaError_t err = cudaMalloc((void **) &p_d, N * sizeof(Polygon));   

  int block_size = 4;
  int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
  cut_poly <<< n_blocks, block_size >>> (p_d, N);

  cudaMemcpy(p_h, p_d, sizeof(Polygon)*N, cudaMemcpyDeviceToHost);

  for (int i=0; i<N; i++)
   printf("%d\n", p_h[i].nbpts);

  cudaFree(p_d);

  return 0;
}

Answer 1

为什么要在内核结束时执行此操作：

  for (int i=0; i<N; i++) {
    pol.update();
  }

记住每个线程都有自己的实例：

Polygon pol;

如果你想在内核的末尾更新每个线程的pol实例，你只需要这样做：

pol.update();

现在，你的情况会怎样？

假设您的update（）代码只有一个：

new_nbpts++;

在其中。

在每次迭代时，你的for循环0到N-1调用pol.update（）：

将new_nbpts设置为零
增加new_nbpts总共nbpts次。
用new_nbpts

希望您能看到这样可以保持nbpts不变。即使在调用pol.update（）的for循环的N次迭代之后，nbpts的值也没有改变。

如果我有以下情况会发生什么：

new_nbpts++;
new_nbpts++;

在我的update（）方法中？然后在每次调用pol.update（）时，我会：

将new_nbpts设置为零
将new_nbpts增加2，总共nbpts次
用新nbpts替换nbpts的值

希望你能看到这对每次调用pol.update（）

加倍的效果

现在，由于你在每个线程中N次调用pol.update（）N次，你将nbpts的起始值加倍N倍，即nbpts * 2 ^ N.由于nbpts开始（在这种情况下）为4，我们有4 * 2 ^ 20 = 4194304

我不太确定你所拥有的是什么，但我的猜测是你在内核结束时运行那个for循环，认为你将以这种方式更新所有不同的Polygon pol实例。但这不是怎么做的，你需要的只是一个

pol.update();

在内核的末尾，如果这是你的意图。

Cuda奇怪的bug

1 个答案: