Question

这个问题很可能有一个简单的解决方案。

我生成的每个线程都要初始化为起始值。例如，如果我有一个字符集char charSet[27] = "abcdefghijklmnopqrstuvwxyz"，我会生成26个线程。因此threadIdx.0对应charSet[0] = a。很简单。

我想我找到了一种方法来做到这一点，直到我检查了我的线程在做什么......

这是我写的一个示例程序：

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <math.h>
#include <stdlib.h>

__global__ void example(int offset, int reqThreads){
//Declarations
   unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

   if(idx < reqThreads){
       unsigned int tid = (offset * threadIdx.x) + blockIdx.x * blockDim.x; //Used to initialize array <-----Problem is here
       printf("%d ", tid);
   }    
}

int main(void){
   //Declarations
   int minLength = 1;
   int maxLength = 2;
   int offset;
   int totalThreads;
   int reqThreads;
   int base = 26;
   int maxThreads = 512;
   int blocks;
   int i,j; 

   for(i = minLength; i<=maxLength; i++){
      offset = i;

      //Calculate parameters
      reqThreads = (int) pow((double) base, (double) offset); //Casting I would never do, but works here
      totalThreads = reqThreads;

      for(j = 1;(totalThreads % maxThreads) != 0; j++) totalThreads += 1;   //Create a multiple of 512

      blocks = totalThreads/maxThreads;

      //Call the kernel

      example<<<blocks, totalThreads>>>(offset, reqThreads);
      cudaThreadSynchronize();
      printf("\n\n");
  }

  system("pause");
  return 0;
}

我的理由是这句话

 unsigned int tid = (offset * threadIdx.x) + blockIdx.x * blockDim.x;

允许我引入偏移量。如果offset为2，threadIdx.0 * offset = 0，threadIdx.1 * offset = 2，threadIdx.2 * offset = 4，等等。这肯定不会发生。当offset == 1：

时，上述程序的输出有效

0 1 2 3 4 5...26.

但是当偏移== 2：

时

1344 1346 1348 1350...

实际上，这些值超出了我的数组范围。所以我不确定出了什么问题。

赞赏任何有建设性的意见。

Answer 1

我相信你的内核调用应该是这样的：

  example<<<blocks, maxThreads>>>(offset, reqThreads);

您的意图是启动512个线程的线程块，因此该数字（maxThreads）应该是您的第二个内核配置参数，即每个块的线程数。

此外，这已被弃用：

  cudaThreadSynchronize();

请改用：

  cudaDeviceSynchronize();

如果您使用内核中的printf获取大量输出，则可以lose some of the output if you exceed the buffer。

最后，我不确定你所推断的指数范围的推理是否正确。

当offset = 2（第二次通过循环），然后26 ^ 2 = 676，然后你将得到1024个线程，（如果你做了上述修复，则在2个线程块中）。第二个线程块将具有

tid = (2*threadIdx.x) + blockDim.x*blockIdx.x;
         (0..164)       (512)         (1)

所以第二个线程块应该打印出512（最小）索引，最多为（2 * 164）+ 512 = 900

（164 = 675 - 511）

第一个threadblock应打印出以下索引：

tid = (2*threadIdx.x) + blockDim.x * blockIdx.x
          (0..511)       (512)           (0)

即。 0到1022

基于CUDA中的偏移量访问数组的问题

1 个答案: