如何消除CUDA中的连续部分

时间:2012-10-02 15:40:10

标签: cuda gpu gpgpu nvidia

我的内核中有一个顺序部分,实际上减慢了它的速度。但是,我不知道如何摆脱内循环。这里有什么建议吗?

__global__ void myKernel( int keep, int inc, int width, int* d_Xnum,
 int* d_Xco, bool* d_Xvalid,int* d_A )
{
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  int j = blockIdx.y * blockDim.y + threadIdx.y;

  int k1;

  if( i < keep && j <= i){
    int counter = 0;

    for(k1 = 0; k1 < inc; k1++){
      if(d_Xvalid[j*inc + k1] == 0)
         counter += (d_Xvalid[i*inc + d_Xco[j*width + k1]]);
    }

    d_A[i*keep+j] = inc - d_Xnum[i] - counter;
  }
}

我相信取消k1会加快我的代码速度。但是,我没有看到如何使用counter。任何建议,想法,想法都将受到欢迎! 这个内核叫做:

         ...
  int t = 32;
  int b = keep/(32)+1;
  int b2 = (inc/32)+1;
  dim3 thread (t, t);
  dim3 block (b, inc);

  // kernel call
  myKernel<<<block, thread>>>(k, inc, width, d_Xnum,
                  d_Xco, d_Xvalid, d_A);
  cudaThreadSynchronize();
            ...

keep约为9000,inc约为20000

1 个答案:

答案 0 :(得分:2)

这不是您问题的准确答案,但它可能可以优化您的代码,并可能帮助您实现k1总和的并行缩减,因为您摆脱了if( i < keep && j <= i)。您可以根据您的gpu模型实现其他优化,例如使用纹理访问这些只读向量。

由于生成索引的方式,许多线程停止等待其他线程完成。您正在启动keep*inc个帖子,但只有最大数量的keep*(keep+1)/2正在执行某些操作(因为条件为j <= i)。

我认为您可以通过以下更改来改善它:

  1. 启动keep*(keep+1)/2个帖子

  2. 对您的代码执行以下操作

    __global__ void myKernel( int keep, int inc, int width, int* d_Xnum,
    int* d_Xco, bool* d_Xvalid,int* d_A )
    {
      int k = blockIdx.x * blockDim.x + threadIdx.x;
      int i = (int)(sqrt(0.25+2.0*k)-0.5); 
      int j = k - i*(i+1)/2;
    
      int k1;
      if( i < keep && j < inc){
        int counter = 0;
    
        for(k1 = 0; k1 < inc; k1++){
          if(d_Xvalid[j*inc + k1] == 0)
             counter += (d_Xvalid[i*inc + d_Xco[j*width + k1]]);
        }
    
        d_A[i*keep+j] = inc - d_Xnum[i] - counter;
      }
    }
    
  3. 你正在做什么(对于keep = 4启动4*4 = 16个主题,在最好的情况下。如果是inc > keep,就好像是这样,你会发起更多线程)可以看作(每个'盒'是一个线程)

    _________________________________
    | i = 0 | i = 0 | i = 0 | i = 0 |
    | j = 0 |   -   |   -   |   -   |
    _________________________________
    | i = 1 | i = 1 | i = 1 | i = 1 |
    | j = 0 | j = 1 |   -   |   -   |
    _________________________________
    | i = 2 | i = 2 | i = 2 | i = 2 |
    | j = 0 | j = 1 | j = 2 |   -   |
    _________________________________
    | i = 3 | i = 3 | i = 3 | i = 3 |
    | j = 0 | j = 1 | j = 2 | j = 3 |
    _________________________________
    

    我建议您根据需要添加索引k并从中生成ijkeep = 4启动(4*(4+1)/2 = 10个帖子)

    _________________________________________________________________________________
    | k = 0 | k = 0 | k = 1 | k = 0 | k = 1 | k = 2 | k = 0 | k = 1 | k = 2 | k = 3 |
    | i = 0 | i = 1 | i = 1 | i = 2 | i = 2 | i = 2 | i = 3 | i = 3 | i = 3 | i = 3 |
    | j = 0 | j = 0 | j = 1 | j = 0 | j = 1 | j = 2 | j = 0 | j = 1 | j = 2 | j = 3 |
    _________________________________________________________________________________
    

    这可以用

    完成
    • i = (int)(sqrt(0.25+2*k)-0.5)

    • j = k - i*(i+1)/2

    你可以接受这个作为食谱,或者看一下背后的数学。

    要告知您,j = 0 i*(i+1)/2 = k i(因为k = 0+1+2+...+i = i*(i+1)/2)。现在,如果你解决这个等式,你得到j!=0的等式(int的向下舍入并确保它在j时得到正确的结果)。 要获得k,如果j为0,则应将i*(i+1)/2减去{{1}}。{/ 1}}。