CUDA - 优化表面检测内核

时间:2015-02-13 22:01:18

标签: c++ cuda

我正在尝试优化我的表面检测内核;给定输入二进制512w x 1024h图像,我想找到图像中的第一个明亮表面。我编写的代码声明了512个线程,并搜索3x3邻域中的第一个亮像素。代码工作正常,但~9.46 ms有点慢,我想让它运行得更快。

编辑1: 性能提升了不到原始内核运行所用时间的一半。罗伯特的内核在我的Quadro K6000上运行4.032 ms

编辑2: 通过将线程数减半来管理以进一步提高性能。现在,我的(罗伯特修改过的)内核在我的Quadro K6000上2.125 ms运行。

使用以下方法调用内核:

firstSurfaceDetection <<< 1, 512 >>> (threshImg, firstSurfaceImg, actualImHeight, actualImWidth);

我想使用共享内存来改善内存提取;关于如何优化这段代码的任何想法?

__global__ void firstSurfaceDetection (float *threshImg, float *firstSurfaceImg, int height, int width) {

int col = threadIdx.x + (blockDim.x*blockIdx.x); 
int rows2skip = 10; 
float thresh = 1.0f;

 //thread Index: (0 -> 511)

if (col < width) {

    if( col == 0 ) { // first col - 0
        for (int row = 0 + rows2skip; row < height - 2; row++) { // skip first 30 rows
            int cnt = 0;
             float neibs[6]; // not shared mem as it reduces speed  

            // get six neighbours - three in same col, and three to the right 
            neibs[0] = threshImg[((row)*width) +(col)];             if(neibs[0] == thresh) { cnt++; }   // current position
            neibs[1] = threshImg[((row)*width) +(col+1)];           if(neibs[1] == thresh) { cnt++; }   // right
            neibs[2] = threshImg[((row+1)*width) +(col)];           if(neibs[2] == thresh) { cnt++; }   // bottom
            neibs[3] = threshImg[((row+1)*width) +(col+1)];         if(neibs[3] == thresh) { cnt++; }   // bottom right
            neibs[4] = threshImg[((row+2)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }   // curr offset by 2 - bottom
            neibs[5] = threshImg[((row+2)*width) +(col+1)];         if(neibs[5] == thresh) { cnt++; }   // curr offset by 2 - bottom right

            if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
                firstSurfaceImg[(row)*width + col] = 1.0f;
                row = height;
            }
        }
    }

    else if ( col == (width-1) ) { // last col 
        for (int row = 0 + rows2skip; row < height -2; row++) { 
            int cnt = 0;
             float neibs[6]; // not shared mem as it reduces speed  

            // get six neighbours - three in same col, and three to the left
            neibs[0] = threshImg[((row)*width) +(col)];             if(neibs[0] == thresh) { cnt++; }   // current position
            neibs[1] = threshImg[((row)*width) +(col-1)];           if(neibs[1] == thresh) { cnt++; }   // left
            neibs[2] = threshImg[((row+1)*width) +(col)];           if(neibs[2] == thresh) { cnt++; }   // bottom
            neibs[3] = threshImg[((row+1)*width) +(col-1)];         if(neibs[3] == thresh) { cnt++; }   // bottom left
            neibs[4] = threshImg[((row+2)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }   // curr offset by 2 - bottom
            neibs[5] = threshImg[((row+2)*width) +(col-1)];         if(neibs[5] == thresh) { cnt++; }   // curr offset by 2 - bottom left

            if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
                firstSurfaceImg[(row)*width + col] = 1.0f;
                row = height;
            }
        }       
    }

    // remaining threads are: (1 -> 510) 

    else { // any col other than first or last column
        for (int row = 0 + rows2skip; row < height - 2; row++) { 

            int cnt = 0;
            float neibs[9]; // not shared mem as it reduces speed   

            // for threads < width/4, get the neighbors
            // get nine neighbours - three in curr col, three each to left and right
            neibs[0] = threshImg[((row)*width) +(col-1)];           if(neibs[0] == thresh) { cnt++; } 
            neibs[1] = threshImg[((row)*width) +(col)];             if(neibs[1] == thresh) { cnt++; } 
            neibs[2] = threshImg[((row)*width) +(col+1)];           if(neibs[2] == thresh) { cnt++; }           
            neibs[3] = threshImg[((row+1)*width) +(col-1)];         if(neibs[3] == thresh) { cnt++; }           
            neibs[4] = threshImg[((row+1)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }           
            neibs[5] = threshImg[((row+1)*width) +(col+1)];         if(neibs[5] == thresh) { cnt++; }           
            neibs[6] = threshImg[((row+2)*width) +(col-1)];         if(neibs[6] == thresh) { cnt++; }           
            neibs[7] = threshImg[((row+2)*width) +(col)];           if(neibs[7] == thresh) { cnt++; }           
            neibs[8] = threshImg[((row+2)*width) +(col+1)];         if(neibs[8] == thresh) { cnt++; }

            if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary

                firstSurfaceImg[(row)*width + col] = 1.0f;
                row = height;
                }
            }
        }       
    }           

__syncthreads();
}

1 个答案:

答案 0 :(得分:1)

这是一个有用的例子,演示了评论中讨论的3个概念中的2个:

  1. 要考虑的第一个优化是512个线程不足以让任何GPU忙碌。我们想要定位10000个或更多线程。 GPU是一个延迟隐藏的机器,当你有太少的线程来帮助GPU隐藏延迟时,你的内核就会变成延迟,这是一种内存限制的问题。实现这一目标的最直接的方法是让每个线程处理图像中的一个像素(允许总共512 * 1024个线程),而不是一个列(总共只允许512个线程)。但是,由于这似乎“打破”了我们的“第一表面检测”算法,我们还必须进行另一次修改(2)。

  2. 一旦我们对所有像素进行并行处理,那么上面第1项的简单改编意味着我们不再知道哪个表面是“第一”,即哪个“亮”表面(每列)最接近行0.算法的这个特性将问题从简单的转换改为简化(实际上每一列图像减少一次。)我们将允许每列并行处理,通过为每个像素分配1个线程,但我们将选择满足最接近到行0的亮度测试的结果像素。一个相对简单的方法就是在最小行(每列中)的每列一列上使用atomicMin,其中发现了一个适当亮的像素邻域。

  3. 以下代码演示了上述2个更改(并且不包括任何共享内存的使用),并演示(对于我来说,在Tesla K40上)与OP的原始内核相比,速度提升1x-20x。加速范围是由于算法的工作因图像结构而异。两种算法都有早退策略。由于for循环上的早期退出结构,原始内核可以做大致或多或少的工作,这取决于在每列中发现“明亮”像素邻域的位置(如果有的话)。因此,如果所有列在第0行附近都有明亮的邻域,我会看到大约1倍的“改进”(即我的内核运行速度与原始速度大致相同)。如果所有列在图像的另一个“末端”附近都有明亮的邻域(仅),我看到了大约20倍的改进。这可能因GPU而异,因为kepler GPU已经提高了全球原子吞吐量,我正在使用它。 编辑:由于这个变量工作,我添加了一个粗略的“早退”策略作为对我的代码的一个微不足道的修改。这使得最短的执行时间接近两个内核之间的近似奇偶校验(即大约1x)。

    剩余的优化可能包括:

    1. 使用共享内存。这应该是对基于相同的基于区块的共享存储器方法的简单修改,例如,用于矩阵乘法。如果你使用square-ish tile,那么你需要调整内核块/网格尺寸以使它们成为“square-ish”。

    2. 改进的还原技术。对于某些图像结构,原子方法可能有些慢。这可以通过切换到每列适当的平行减少来改善。您可以通过将测试图像设置为各处的阈值来对我的内核进行“最坏情况”测试。这应该导致原始内核以最快的速度运行,而我的内核运行速度最慢,但在这种情况下我没有观察到内核的任何明显减速。我内核的执行时间非常不变。同样,这可能与GPU有关。

    3. 示例:

      #include <stdlib.h>
      #include <stdio.h>
      
      #define SKIP_ROWS 10
      #define THRESH 1.0f
      
      #include <time.h>
      #include <sys/time.h>
      #define USECPSEC 1000000ULL
      
      unsigned long long dtime_usec(unsigned long long start){
      
        timeval tv;
        gettimeofday(&tv, 0);
        return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
      }
      
      __global__ void firstSurfaceDetection (float *threshImg, float *firstSurfaceImg, int height, int width) {
      
      int col = threadIdx.x + (blockDim.x*blockIdx.x); 
      int rows2skip = SKIP_ROWS; 
      float thresh = THRESH;
      
       //thread Index: (0 -> 511)
      
      if (col < width) {
      
          if( col == 0 ) { // first col - 0
              for (int row = 0 + rows2skip; row < height; row++) { // skip first 30 rows
                  int cnt = 0;
                   float neibs[6]; // not shared mem as it reduces speed  
      
                  // get six neighbours - three in same col, and three to the right 
                  neibs[0] = threshImg[((row)*width) +(col)];             if(neibs[0] == thresh) { cnt++; }   // current position
                  neibs[1] = threshImg[((row)*width) +(col+1)];           if(neibs[1] == thresh) { cnt++; }   // right
                  neibs[2] = threshImg[((row+1)*width) +(col)];           if(neibs[2] == thresh) { cnt++; }   // bottom
                  neibs[3] = threshImg[((row+1)*width) +(col+1)];         if(neibs[3] == thresh) { cnt++; }   // bottom right
                  neibs[4] = threshImg[((row+2)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }   // curr offset by 2 - bottom
                  neibs[5] = threshImg[((row+2)*width) +(col+1)];         if(neibs[5] == thresh) { cnt++; }   // curr offset by 2 - bottom right
      
                  if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
                      firstSurfaceImg[(row)*width + col] = 1.0f;
                      row = height;
                  }
              }
          }
      
          else if ( col == (width-1) ) { // last col 
              for (int row = 0 + rows2skip; row < height; row++) { 
                  int cnt = 0;
                   float neibs[6]; // not shared mem as it reduces speed  
      
                  // get six neighbours - three in same col, and three to the left
                  neibs[0] = threshImg[((row)*width) +(col)];             if(neibs[0] == thresh) { cnt++; }   // current position
                  neibs[1] = threshImg[((row)*width) +(col-1)];           if(neibs[1] == thresh) { cnt++; }   // left
                  neibs[2] = threshImg[((row+1)*width) +(col)];           if(neibs[2] == thresh) { cnt++; }   // bottom
                  neibs[3] = threshImg[((row+1)*width) +(col-1)];         if(neibs[3] == thresh) { cnt++; }   // bottom left
                  neibs[4] = threshImg[((row+2)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }   // curr offset by 2 - bottom
                  neibs[5] = threshImg[((row+2)*width) +(col-1)];         if(neibs[5] == thresh) { cnt++; }   // curr offset by 2 - bottom left
      
                  if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
                      firstSurfaceImg[(row)*width + col] = 1.0f;
                      row = height;
                  }
              }       
          }
      
          // remaining threads are: (1 -> 510) 
      
          else { // any col other than first or last column
              for (int row = 0 + rows2skip; row < height; row++) { 
      
                  int cnt = 0;
                  float neibs[9]; // not shared mem as it reduces speed   
      
                  // for threads < width/4, get the neighbors
                  // get nine neighbours - three in curr col, three each to left and right
                  neibs[0] = threshImg[((row)*width) +(col-1)];           if(neibs[0] == thresh) { cnt++; } 
                  neibs[1] = threshImg[((row)*width) +(col)];             if(neibs[1] == thresh) { cnt++; } 
                  neibs[2] = threshImg[((row)*width) +(col+1)];           if(neibs[2] == thresh) { cnt++; }           
                  neibs[3] = threshImg[((row+1)*width) +(col-1)];         if(neibs[3] == thresh) { cnt++; }           
                  neibs[4] = threshImg[((row+1)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }           
                  neibs[5] = threshImg[((row+1)*width) +(col+1)];         if(neibs[5] == thresh) { cnt++; }           
                  neibs[6] = threshImg[((row+2)*width) +(col-1)];         if(neibs[6] == thresh) { cnt++; }           
                  neibs[7] = threshImg[((row+2)*width) +(col)];           if(neibs[7] == thresh) { cnt++; }           
                  neibs[8] = threshImg[((row+2)*width) +(col+1)];         if(neibs[8] == thresh) { cnt++; }
      
                  if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary
      
                      firstSurfaceImg[(row)*width + col] = 1.0f;
                      row = height;
                      }
                  }
              }       
          }           
      
      __syncthreads();
      }
      
      __global__ void firstSurfaceDetection_opt (const float * __restrict__ threshImg, int *firstSurfaceImgRow, int height, int width) {
      
        int col = threadIdx.x + (blockDim.x*blockIdx.x); 
        int row = threadIdx.y + (blockDim.y*blockIdx.y);
      
        int rows2skip = SKIP_ROWS; 
        float thresh = THRESH;
      
        if ((row >= rows2skip) && (row < height-2) && (col < width) && (row < firstSurfaceImgRow[col])) {
      
          int cnt = 0;
          int inc = 0;
          if (col == 0) inc = +1;
          if (col == (width-1)) inc = -1;
          if (inc){
                  cnt = 3;
                  if (threshImg[((row)*width)   +(col)]     == thresh) cnt++;
                  if (threshImg[((row)*width)   +(col+inc)] == thresh) cnt++;
                  if (threshImg[((row+1)*width) +(col)]     == thresh) cnt++;   
                  if (threshImg[((row+1)*width) +(col+inc)] == thresh) cnt++;      
                  if (threshImg[((row+2)*width) +(col)]     == thresh) cnt++;     
                  if (threshImg[((row+2)*width) +(col+inc)] == thresh) cnt++;
                  }
          else {
                  // get nine neighbours - three in curr col, three each to left and right
                  if (threshImg[((row)*width)   +(col-1)] == thresh) cnt++;
                  if (threshImg[((row)*width)   +(col)]   == thresh) cnt++;
                  if (threshImg[((row)*width)   +(col+1)] == thresh) cnt++;
                  if (threshImg[((row+1)*width) +(col-1)] == thresh) cnt++;
                  if (threshImg[((row+1)*width) +(col)]   == thresh) cnt++;   
                  if (threshImg[((row+1)*width) +(col+1)] == thresh) cnt++;      
                  if (threshImg[((row+2)*width) +(col-1)] == thresh) cnt++;
                  if (threshImg[((row+2)*width) +(col)]   == thresh) cnt++;     
                  if (threshImg[((row+2)*width) +(col+1)] == thresh) cnt++;
                  }
          if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary
                  atomicMin(firstSurfaceImgRow + col, row);
                  }
          }
      }
      
      
      int main(int argc, char *argv[]){
      
        float *threshImg, *h_threshImg, *firstSurfaceImg, *h_firstSurfaceImg;
        int *firstSurfaceImgRow, *h_firstSurfaceImgRow;
        int actualImHeight = 1024;
        int actualImWidth = 512;
        int row_set = 512;
        if (argc > 1){
          int my_val = atoi(argv[1]);
          if ((my_val > SKIP_ROWS) && (my_val < actualImHeight - 3)) row_set = my_val;
          }
      
        h_firstSurfaceImg = (float *)malloc(actualImHeight*actualImWidth*sizeof(float));
        h_threshImg = (float *)malloc(actualImHeight*actualImWidth*sizeof(float));
        h_firstSurfaceImgRow = (int *)malloc(actualImWidth*sizeof(int));
        cudaMalloc(&threshImg, actualImHeight*actualImWidth*sizeof(float));
        cudaMalloc(&firstSurfaceImg, actualImHeight*actualImWidth*sizeof(float));
        cudaMalloc(&firstSurfaceImgRow, actualImWidth*sizeof(int));
        cudaMemset(firstSurfaceImgRow, 1, actualImWidth*sizeof(int));
        cudaMemset(firstSurfaceImg, 0, actualImHeight*actualImWidth*sizeof(float));
      
        for (int i = 0; i < actualImHeight*actualImWidth; i++) h_threshImg[i] = 0.0f;
        // insert "bright row" here
        for (int i = (row_set*actualImWidth); i < ((row_set+3)*actualImWidth); i++) h_threshImg[i] = THRESH;
      
        cudaMemcpy(threshImg, h_threshImg, actualImHeight*actualImWidth*sizeof(float), cudaMemcpyHostToDevice);
      
      
        dim3 grid(1,1024);
        //warm-up run
        firstSurfaceDetection_opt <<< grid, 512 >>> (threshImg, firstSurfaceImgRow, actualImHeight, actualImWidth);
        cudaDeviceSynchronize();
        cudaMemset(firstSurfaceImgRow, 1, actualImWidth*sizeof(int));
        cudaDeviceSynchronize();
        unsigned long long t2 = dtime_usec(0);
        firstSurfaceDetection_opt <<< grid, 512 >>> (threshImg, firstSurfaceImgRow, actualImHeight, actualImWidth);
        cudaDeviceSynchronize();
        t2 = dtime_usec(t2);
        cudaMemcpy(h_firstSurfaceImgRow, firstSurfaceImgRow, actualImWidth*sizeof(float), cudaMemcpyDeviceToHost);
        unsigned long long t1 = dtime_usec(0);
        firstSurfaceDetection <<< 1, 512 >>> (threshImg, firstSurfaceImg, actualImHeight, actualImWidth);
        cudaDeviceSynchronize();
        t1 = dtime_usec(t1);
        cudaMemcpy(h_firstSurfaceImg, firstSurfaceImg, actualImWidth*actualImHeight*sizeof(float), cudaMemcpyDeviceToHost); 
      
        printf("t1 = %fs, t2 = %fs\n", t1/(float)USECPSEC, t2/(float)USECPSEC);
        // validate results
        for (int i = 0; i < actualImWidth; i++) 
          if (h_firstSurfaceImgRow[i] < actualImHeight) 
            if (h_firstSurfaceImg[(h_firstSurfaceImgRow[i]*actualImWidth)+i] != THRESH)
              {printf("mismatch at %d, was %f, should be %d\n", i, h_firstSurfaceImg[(h_firstSurfaceImgRow[i]*actualImWidth)+i], THRESH); return 1;}
        return 0;
      }
      $ nvcc -arch=sm_35 -o t667 t667.cu
      $ ./t667
      t1 = 0.000978s, t2 = 0.000050s
      $
      

      注意:

      1. 上面的例子在行= 512处的图像中一直插入一个“明亮的邻域”,因此在我的情况下(K40c)给出了几乎20倍的中间加速因子。

      2. 为了简洁起见,我放弃了proper cuda error checking。不过我推荐它。

      3. 任一内核的执行性能都取决于它是否首次运行。这可能与缓存和一般预热效果有关。因此,为了得到理智的结果,我首先运行我的内核作为额外的不定时热身运行。

      4. 我没有追求共享内存优化的原因之一是这个问题已经非常小了,至少对于像K40这样的大型GPU,它几乎完全适合二级缓存(特别是我的内核,因为它使用较小的输出数据结构。)鉴于此,共享内存可能无法提供很多性能提升。

      5. 编辑:我已经修改了代码(再次),以便插入明亮边界的测试图像中的行(行)可以作为命令行参数传递,而我已经在3个不同的设备上测试了代码,使用3个不同的设置作为亮行:

        execution time on:     K40    C2075    Quadro NVS 310
        bright row =   15:   31/33    23/45       29/314
        bright row =  512:  978/50  929/112     1232/805
        bright row = 1000: 2031/71 2033/149    2418/1285
        
        all times are microseconds (original kernel/optimized kernel)
        CUDA 6.5, CentOS 6.2