Question

我正在使用必须在指针指针上运行的CUDA内核。内核基本上执行大量非常小的减少，最好是连续执行，因为减少的大小为Nptrs = 3-4。以下是内核的两个实现：

__global__
void kernel_RaiseIndexSLOW(double*__restrict__*__restrict__ A0,
        const double*__restrict__*__restrict__ B0,
        const double*__restrict__*__restrict__ C0,
        const int Nptrs, const int Nx){
      const int i = blockIdx.y;
      const int j = blockIdx.z;
      const int idx = blockIdx.x*blockDim.x + threadIdx.x;
      if(i<Nptrs) {
         if(j<Nptrs) {
           for (int x = idx; x < Nx; x += blockDim.x*gridDim.x){
              A0gpu[i+3*j][x] = B0gpu[i][x]*C0gpu[3*j][x]
                       +B0gpu[i+3][x]*C0gpu[1+3*j][x]
                       +B0gpu[i+6][x]*C0gpu[2+3*j][x];               
           }
         }
       }
 }

__global__
void kernel_RaiseIndexsepderef(double*__restrict__*__restrict__  A0gpu, 
               const double*__restrict__*__restrict__ B0gpu,
               const double*__restrict__*__restrict__ C0gpu,
               const int Nptrs, const int Nx){
const int i = blockIdx.y;
const int j = blockIdx.z;
const int idx = blockIdx.x*blockDim.x + threadIdx.x;
if(i<Nptrs) {
  if(j<Nptrs){
    double*__restrict__ A0ptr = A0gpu[i+3*j];
    const double*__restrict__ B0ptr0 = B0gpu[i];
    const double*__restrict__ C0ptr0 = C0gpu[3*j];
    const double*__restrict__ B0ptr1 = B0ptr0+3;
    const double*__restrict__ B0ptr2 = B0ptr0+6;
    const double*__restrict__ C0ptr1 = C0ptr0+1;
    const double*__restrict__ C0ptr2 = C0ptr0+2;

    for (int x = idx; x < Nx; x +=blockDim.x *gridDim.x){
      double d2 = C0ptr0[x];
      double d4 = C0ptr1[x]; //FLAGGED
      double d6 = C0ptr2[x]; //FLAGGED
      double d1 = B0ptr0[x];
      double d3 = B0ptr1[x]; //FLAGGED
      double d5 = B0ptr2[x]; //FLAGGED
      A0ptr[x] = d1*d2 + d3*d4 + d5*d6;

    }
   }                                                                        
  }
 }

如名称所示，内核“sepderef”的速度比其对应的快40％，一旦启动开销计算，在Nptrs = 3时实现约85GBps的有效带宽，在具有ECC的M2090上实现Nx = 60000 （~160GBps是最佳的）。

通过nvvp运行这些内容表明内核是带宽限制的。然而，奇怪的是，我标记为// FLAGGED的行被分析器突出显示为次优内存访问的区域。我不明白为什么会这样，因为这里的访问看起来很合适。为什么不呢？

编辑：我忘记指出这一点，但请注意// FLAGGED区域正在访问我已经完成算术运算的指针，而其他区域则使用方括号运算符进行访问。

Answer 1

要理解这种行为，需要注意到目前为止所有CUDA GPU都执行指令in-order。在发出从存储器加载操作数的指令之后，仍然继续执行其他独立指令。但是，一旦遇到依赖于来自内存的操作数的指令，该指令流上的所有进一步操作都将停止，直到操作数变为可用。

在“sepderef”示例中，您在对它们求和之前从内存加载所有操作数，这意味着每次循环迭代可能只会发生一次全局内存延迟（每次循环迭代有6次加载，但它们都可以重叠只有第一次添加循环才会停止，直到它的操作数可用。停止后，所有其他添加将很容易或很快就可以获得它们的操作数。）

在“SLOW”示例中，内存和加法的加载是混合的，因此每次循环操作会导致多次全局内存延迟。

您可能想知道为什么编译器在计算之前不会自动重新排序加载指令。 CUDA编译器曾经非常积极地这样做，在操作数等待使用之前花费额外的寄存器。然而，CUDA 8.0在这方面似乎没那么激进，更多地依赖于源代码中的指令顺序。这为程序员提供了以最佳方式构建代码的更好机会where the compiler's instruction scheduling was suboptimal。同时，即使以前的编译器版本正确，它也会给程序员带来更多的负担，即明确地安排指令。

CUDA指针算法会导致未合并的内存访问吗？

1 个答案: