奇怪的CUDA对大量线程的行为

时间:2014-07-15 16:10:23

标签: c++ cuda

我想准备我的CUDA内核来处理大量的粒子(远远超过65535,这是gridDim的最大值)。我尝试为任何<<<numBlocks, threadsPerBlock>>>值创建适当的线程索引映射。

我写了这个:

__global__ void step_k(float* position, size_t numElements, unsigned int* blabla) 
{   
    unsigned int i = calculateIndex();

    if (i < numElements){
        blabla[i] = i;
    }
}

__device__ unsigned int calculateIndex(){
    unsigned int xIndex = blockIdx.x*blockDim.x+threadIdx.x;
    unsigned int yIndex = blockIdx.y*blockDim.y+threadIdx.y;
    unsigned int zIndex = blockIdx.z*blockDim.z+threadIdx.z;

    unsigned int xSize = gridDim.x*blockDim.x;
    unsigned int ySize = gridDim.y*blockDim.y;

    return xSize*ySize*zIndex+xSize*yIndex+xIndex;
}

我用这种方式:

void CudaSphFluids::step(void)
{
    //dim3 threadsPerBlock(1024, 1024, 64);
    //dim3 numBlocks(65535, 65535, 65535);

    dim3 numBlocks(1, 1, 1);
    dim3 threadsPerBlock(256, 256, 1);

    unsigned int result[256] = {};
    unsigned int* d_results;
    cudaMalloc( (void**) &d_results,sizeof(unsigned int)*256);

    step_k<<<numBlocks, threadsPerBlock>>>(d_position, 256, d_results);

    cudaMemcpy(result,d_results,sizeof(unsigned int)*256,cudaMemcpyDeviceToHost);

    CLOG(INFO, "SPH")<<"STEP";
    for(unsigned int t=0; t<256;t++) {
        cout<<result[t]<<"; ";
    }
    cout<<endl;

    cudaFree(d_results);
    Sleep(200);
}

似乎没问题(从0到255增加数字):

dim3 numBlocks(1, 1, 1);
dim3 threadsPerBlock(256, 1, 1);

适用于:

dim3 numBlocks(1, 1, 1);
dim3 threadsPerBlock(256, 3, 1);

但是当我尝试运行它时:

dim3 numBlocks(1, 1, 1);
dim3 threadsPerBlock(256, 5, 1);

enter image description here

有:

dim3 numBlocks(1, 1, 1);
dim3 threadsPerBlock(256, 10, 1);

enter image description here

和更大的值,如:

dim3 numBlocks(1, 1, 1);
dim3 threadsPerBlock(256, 256, 1);

它变得疯狂:

enter image description here

然后我尝试使用智能家伙网站上的另一个映射:

__device__ int getGlobalIdx_3D_3D()
{
int blockId = blockIdx.x 
 + blockIdx.y * gridDim.x 
 + gridDim.x * gridDim.y * blockIdx.z; 
int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z)
  + (threadIdx.z * (blockDim.x * blockDim.y))
  + (threadIdx.y * blockDim.x)
  + threadIdx.x;
return threadId;
}

但遗憾的是它不起作用。 (数字不同,但也错了)。

任何想法是什么原因造成这种奇怪的表演?

我在GeForce GTX 560Ti(sm_21)上使用CUDA 6.0,在NSight上使用VS2012。

1 个答案:

答案 0 :(得分:1)

这是每块请求65536个线程:

dim3 threadsPerBlock(256, 256, 1);

这在任何当前的CUDA GPU上都是不可接受的,limited to either 512 or 1024 threads per block

这些也会在每个块中启动太多线程:

dim3 threadsPerBlock(256, 5, 1);
dim3 threadsPerBlock(256, 10, 1);

首先将proper cuda error checking添加到您的计划中。我建议在发布之前在任何CUDA代码上执行此操作。您将获得更多信息,其他人将能够更好地帮助您。

虽然您没有显示完整的内核,但您的内核索引似乎已正确设置为3D索引。因此,这可能只是修改这一行:

dim3 numBlocks(1, 1, 1);

您可能希望从GPU中获得合理的性能。