在CUDA中可疑地减慢了全局内存访问速度

时间:2017-04-03 07:54:57

标签: c++ memory cuda global

我有一个小的CUDA内核,表现得非常可疑。内核处理每个线程中的单个点,并将其与大型数据集进行比较。基于此数据集,计算指标并将其保存到输出中:

 __global__ void distanceKernel(float4*__restrict__  data, float4* __restrict__ points, float* __restrict__ output, unsigned int numdata, unsigned int numpoints) {
 //Read the threadID
 const int col = threadIdx.x;

 //Value to be computed
 float indicator=0.0f;

 //Loads the specific point
 float4 const point_p = points[blockIdx.x * blockDim.x + threadIdx.x];

 //Stream all the data from global memory to shared memory
 for (size_t round = 0; round < (numdata+ BLOCK_SIZE - 1) / (BLOCK_SIZE); ++round) {

     __shared__ float4 sData[BLOCK_SIZE * 3];

     sData[col * 3] = data[BLOCK_SIZE*CUDA_FACE_OFFSETTING*round + (col)*CUDA_FACE_OFFSETTING];
     sData[col * 3 + 1] = data[BLOCK_SIZE*CUDA_FACE_OFFSETTING*round + (col)*CUDA_FACE_OFFSETTING + 1];
     sData[col * 3 + 2] = data[BLOCK_SIZE*CUDA_FACE_OFFSETTING*round + (col)*CUDA_FACE_OFFSETTING + 2];

     __syncthreads();
     for (int i = 0; i < BLOCK_SIZE; i++) {
         //... Lots of computations ...
     }
     __syncthreads();
 }
 //Write out the data
 output[blockIdx.x * blockDim.x + threadIdx.x] = indicator;}

代码工作正常,结果很完美。我的问题是,最后一行似乎太慢(非常非常慢),虽然它应该合并。使用最后一行,执行时间为31.5秒(64个块和64个线程)。但是,删除它时,执行时间仅为0.2秒。使用std :: chrono :: high_resolution_clock进行测量,在启动内核之前和cudaDeviceSynchronize()之后调用它。

有人有任何想法,为什么会这么慢?最后一次内存访问与在代码开头读取变量point_p似乎没什么不同。此外,numdata很大,这意味着读取数百MB数据比写回单个值要快得多......

提前感谢您的帮助!

0 个答案:

没有答案