Question

我正在尝试加速CPU二进制搜索。不幸的是，GPU版本总是比CPU版本慢得多。也许问题不适合GPU或者我做错了什么？

CPU版本（约0.6ms）：使用长度为2000的有序数组并对特定值进行二进制搜索

...
Lookup ( search[j], search_array, array_length, m );
...
int Lookup ( int search, int* arr, int length, int& m )
{      
   int l(0), r(length-1);
   while ( l <= r ) 
   {
      m = (l+r)/2;      
      if ( search < arr[m] )
         r = m-1;
      else if ( search > arr[m] )
         l = m+1;
      else
      {         
         return index[m];
      }         
   }
   if ( arr[m] >= search )
      return m;
   return (m+1);      
}

GPU版本（约20ms）：使用长度为2000的有序数组并对特定值进行二进制搜索

....
p_ary_search<<<16, 64>>>(search[j], array_length, dev_arr, dev_ret_val);
....

__global__ void p_ary_search(int search, int array_length, int *arr, int *ret_val ) 
{
   const int num_threads = blockDim.x * gridDim.x;
   const int thread = blockIdx.x * blockDim.x + threadIdx.x;
   int set_size = array_length;

   ret_val[0] = -1; // return value
   ret_val[1] = 0;  // offset

   while(set_size != 0)
   {
      // Get the offset of the array, initially set to 0
      int offset = ret_val[1];

      // I think this is necessary in case a thread gets ahead, and resets offset before it's read
      // This isn't necessary for the unit tests to pass, but I still like it here
      __syncthreads();

      // Get the next index to check
      int index_to_check = get_index_to_check(thread, num_threads, set_size, offset);

      // If the index is outside the bounds of the array then lets not check it
      if (index_to_check < array_length)
      {
         // If the next index is outside the bounds of the array, then set it to maximum array size
         int next_index_to_check = get_index_to_check(thread + 1, num_threads, set_size, offset);
         if (next_index_to_check >= array_length)
         {
            next_index_to_check = array_length - 1;
         }

         // If we're at the mid section of the array reset the offset to this index
         if (search > arr[index_to_check] && (search < arr[next_index_to_check])) 
         {
            ret_val[1] = index_to_check;
         }
         else if (search == arr[index_to_check]) 
         {
            // Set the return var if we hit it
            ret_val[0] = index_to_check;
         }
      }

      // Since this is a p-ary search divide by our total threads to get the next set size
      set_size = set_size / num_threads;

      // Sync up so no threads jump ahead and get a bad offset
      __syncthreads();
   }
}

即使我尝试更大的数组，时间比也没有更好。

Answer 1

你的代码中有太多不同的分支，所以你基本上是在GPU上序列化整个过程。您希望分解工作，以便同一warp中的所有线程在分支中采用相同的路径。请参阅CUDA Best Practices Guide的第47页。

Answer 2

我必须承认我不完全确定你的内核是做什么的，但我是否正确地假设你只是在寻找一个满足你搜索条件的索引？如果是这样，那么请查看CUDA附带的缩减示例，以获取有关如何构造和优化此类查询的一些指示。（您正在做的主要是尝试减少与查询最接近的索引）

一些快速指示：

你正在对全局内存执行大量的读写操作，这非常慢。请尝试使用共享内存。

其次请记住__syncthreads（）只同步同一个块中的线程，因此您对全局内存的读/写不一定会在所有线程中同步（尽管全局内存写入的延迟实际上可能使它看起来好像他们这样做）

CUDA二进制搜索实现

2 个答案: