Question

我有一个小的CUDA内核，表现得非常可疑。内核处理每个线程中的单个点，并将其与大型数据集进行比较。基于此数据集，计算指标并将其保存到输出中：

 __global__ void distanceKernel(float4*__restrict__  data, float4* __restrict__ points, float* __restrict__ output, unsigned int numdata, unsigned int numpoints) {
 //Read the threadID
 const int col = threadIdx.x;

 //Value to be computed
 float indicator=0.0f;

 //Loads the specific point
 float4 const point_p = points[blockIdx.x * blockDim.x + threadIdx.x];

 //Stream all the data from global memory to shared memory
 for (size_t round = 0; round < (numdata+ BLOCK_SIZE - 1) / (BLOCK_SIZE); ++round) {

     __shared__ float4 sData[BLOCK_SIZE * 3];

     sData[col * 3] = data[BLOCK_SIZE*CUDA_FACE_OFFSETTING*round + (col)*CUDA_FACE_OFFSETTING];
     sData[col * 3 + 1] = data[BLOCK_SIZE*CUDA_FACE_OFFSETTING*round + (col)*CUDA_FACE_OFFSETTING + 1];
     sData[col * 3 + 2] = data[BLOCK_SIZE*CUDA_FACE_OFFSETTING*round + (col)*CUDA_FACE_OFFSETTING + 2];

     __syncthreads();
     for (int i = 0; i < BLOCK_SIZE; i++) {
         //... Lots of computations ...
     }
     __syncthreads();
 }
 //Write out the data
 output[blockIdx.x * blockDim.x + threadIdx.x] = indicator;}

代码工作正常，结果很完美。我的问题是，最后一行似乎太慢（非常非常慢），虽然它应该合并。使用最后一行，执行时间为31.5秒（64个块和64个线程）。但是，删除它时，执行时间仅为0.2秒。使用std :: chrono :: high_resolution_clock进行测量，在启动内核之前和cudaDeviceSynchronize（）之后调用它。

有人有任何想法，为什么会这么慢？最后一次内存访问与在代码开头读取变量point_p似乎没什么不同。此外，numdata很大，这意味着读取数百MB数据比写回单个值要快得多......

提前感谢您的帮助！

在CUDA中可疑地减慢了全局内存访问速度

0 个答案: