Question

我写了一个简单的CUDA内核如下：

    __global__ void cudaDoSomethingInSharedMemory(float* globalArray, pitch){

      __shared__ float sharedInputArray[1088];
      __shared__ float sharedOutputArray[1088];

      int tid = threadIdx.x //Use 1D block
      int rowIdx = blockIdx.x //Use 1D grid

      int rowOffset = pitch/sizeof(float);//Offset in elements (not in bytes)

       //Copy data from global memory to shared memory (checked)
       while(tid < 1088){
           sharedInputArray[tid] = *(((float*) globalArray) + rowIdx*rowOffset + tid);
           tid += blockDim.x;
           __syncthreads();
       }
       __syncthreads();

       //Do something (already simplified and the problem still exists)
       tid = threadIdx.x;
       while(tid < 1088){
           if(tid%2==1){
              if(tid == 1087){
                 sharedOutputArray[tid/2 + 544] = 321;
              }
              else{
                  sharedOutputArray[tid/2 + 544] = 321;
              }
           }
           tid += blockDim.x;
           __syncthreads();
       }

       tid = threadIdx.x;
       while(tid < 1088){
           if(tid%2==0){
               if(tid==0){
                    sharedOutputArray[tid/2] = 123;
               }
               else{
                    sharedOutputArray[tid/2] = 123;
               }

           }
           tid += blockDim.x;
           __syncthreads();
       }
       __syncthreads();

       //Copy data from shared memory back to global memory (and add read-back for test)
       float temp = -456;
       tid = threadIdx.x;
       while(tid < 1088){
           *(((float*) globalArray) + rowIdx*rowOffset + tid) = sharedOutputArray[tid];
            temp = *(((float*) globalArray) + rowIdx*rowOffset + tid);//(1*) Errors are found.
            __syncthreads();
            tid += blockDim.x;
       }
       __syncthreads();
    }

代码是将“sharedOutputArray”从“隔行扫描”更改为“聚集”：“123 321 123 321 ... 123 321”更改为“123 123 123 .. 123 321 321 321 ... 321”将聚簇结果输出到全局内存数组“globalArray”。 “globalArray”由“cudaMallocPitch（）”

分配

此内核用于处理2D数组。这个想法很简单：一行一个块（所以1D网格和块数等于行数）和每行N个线程。行号为1920，列号为1088.因此有1920个块。

问题是：当N（一个块中的线程数）是64,128或256时，一切正常（至少看起来像工作）很好。但是，当N为512（我使用的是带有CUDA计算能力的GTX570，并且一个块的每个维度的最大大小为1024）时，就会发生错误。

错误是：从位置256到287（索引从0开始，错误条长度是32个元素，128位）的全局存储器中的一行中的元素（每个是4字节浮点数）是0而不是123.它看起来像“123 123 123 ... 0 0 0 0 0 ... 0 123 123 ...”。我检查了上面的行（1 *），那些元素在“sharedOutputArray”中是123，当元素（例如tid == 270）被读入（1 *）时，“temp”显示0.我试图看到“tid == 255“和”tid == 288“，元素为123（相关）。几乎所有1920行都发生了这种类型的错误。

我尝试“同步”（可能已经过度同步）线程，但它不起作用。令我困惑的是为什么64,128或256个线程工作正常但512没有工作。我知道使用512个线程可能没有针对性能进行优化，我只想知道我在哪里犯了错误。

提前谢谢。

Answer 1

您在条件代码中使用__syncthreads()，其中条件不会在块的线程之间统一评估。 Don't do that

在您的情况下，您只需删除__syncthreads()循环中的while，因为它没有用处。

将共享内存复制到全局内存导致错误的部分零结果

1 个答案: