Question

我正在编写一个CUDA内核，为rows * cols主矩阵中的每个位置创建一个3x3协方差矩阵。所以3D矩阵的大小是行* cols * 9，我相应地在一个malloc中分配。我需要在单个索引值

中访问它

3x3协方差矩阵的9个值根据来自某些其他2D阵列的适当行r和列c来设置它们的值。

换句话说 - 我需要计算适当的索引来访问3x3协方差矩阵的9个元素，以及作为值输入的2D矩阵的行和列偏移量，以及相应的索引用于存储阵列。

我试图将其简化为以下内容：

   //I am calling this kernel with 1D blocks who are 512 cols x 1row. TILE_WIDTH=512
   int bx = blockIdx.x;
   int by = blockIdx.y;
   int tx = threadIdx.x;
   int ty = threadIdx.y;
   int r = by + ty; 
   int c = bx*TILE_WIDTH + tx;
   int offset = r*cols+c; 
   int ndx = r*cols*rows + c*cols;


   if((r < rows) && (c < cols)){ //this IF statement is trying to avoid the case where a threadblock went bigger than my original array..not sure if correct

      d_cov[ndx + 0] = otherArray[offset];//otherArray just contains a value that I might do some operations on to set each of the ndx0-ndx9 values in d_cov
      d_cov[ndx + 1] = otherArray[offset];
      d_cov[ndx + 2] = otherArray[offset];
      d_cov[ndx + 3] = otherArray[offset];
      d_cov[ndx + 4] = otherArray[offset];
      d_cov[ndx + 5] = otherArray[offset];  
      d_cov[ndx + 6] = otherArray[offset];
      d_cov[ndx + 7] = otherArray[offset];   
      d_cov[ndx + 8] = otherArray[offset];  
   }

当我用在CPU上计算的值检查这个数组时，它在i =行上循环，j = cols，k = 1..9

结果不匹配。

换句话说，d_cov [i * rows * cols + j * cols + k]！= correctAnswer [i] [j] [k]

有人可以给我任何关于如何解决这个问题的提示吗？它是索引问题还是其他一些逻辑错误？

Answer 1

这是我通常用于调试这些问题的技术，而不是答案（我没有足够难以找到）。首先，将目标数组中的所有值设置为NaN。（您可以通过cudaMemset执行此操作 - 将每个字节设置为0xFF。）然后尝试将每个位置统一设置为行的值，然后检查结果。从理论上讲，它应该看起来像：

0 0 0 ... 0
1 1 1 ... 1
. . . .   .
. . .  .  .
. . .   . .
n n n ... n

如果您看到NaN，则无法写入元素;如果你看到行元素不合适，就会出现问题，并且它们通常会以暗示的方式出现。使用列值和平面执行类似的操作。通常，这个技巧有助于我找到部分指数计算错误，这是大部分的战斗。希望有所帮助。

Answer 2

我可能只是愚蠢，但这一行的逻辑是什么？

int ndx = r*cols*rows + c*cols;

你不应该

int ndx = offset*9;

如果你说协方差数组的大小是行* cols * 9，那么不会偏移* 9将你带到3D协方差数组中与输入数组中的位置相同的位置。那么偏移* 9 + 0将是偏移处元素的3x3协方差矩阵的位置（0,0），偏移* 9 + 1将是（0,1），偏移* 9 + 2将是（0， 2），offset * 9 + 3将是（1,0），依此类推，直到偏移* 9 + 8。

将偏移索引计算到3D数组中的麻烦

2 个答案: