Cuda不能运行超过1024 * 255个线程

时间:2014-05-06 22:48:53

标签: c++ cuda thrust

我在亚马逊的K520 GPU上运行,拥有1500核和4GB RAM。我正在尝试运行1024 * 850线程的内核。我知道每个块最多只能获得1024个线程,但是当我无法使用每个块1024个线程启动超过255个块时,我感到很惊讶(我得到了启动错误)。我认为网格尺寸的限制是2 ^ 16。当我运行一个空核时,它会很好地完成它。这让我觉得某处没有足够的记忆。我想知道我是否可以得到关于发生了什么的解释。谢谢。这是内核:

__global__ void dotSubCentroidNorm
(
 Pt* segments,
 int pointCount,
 const Pt* centroids,
 const int* segmentChanges,
 float *dotResult
 )
{

  int idx = index();
  if(idx>=pointCount)
    return;
  int segment = segments[idx].segmentIndex;
  if(segment<0)
    return;
  int segPtCount = segmentChanges[segment+1]-segmentChanges[segment];
  Pt &pt = segments[idx];
  if(segPtCount==0)
  {
    printf("segment pt count =0 %d %d\n",idx, segment);
    return;
  }
  const Pt &ctr = centroids[segment];
  pt.x=pt.x-ctr.x/segPtCount;
  pt.y=pt.y-ctr.y/segPtCount;
  pt.z=pt.z-ctr.z/segPtCount;

  dotResult[idx] = pt.x*pt.x;
  dotResult[pointCount + idx] = pt.x*pt.y;
  dotResult[pointCount*2 + idx] = pt.x*pt.z;
  dotResult[pointCount*3 + idx] = pt.y*pt.y;
  dotResult[pointCount*4 + idx] = pt.y*pt.z;
  dotResult[pointCount*5 + idx] = pt.z*pt.z;
}

和结构:

struct Pt
{
  float x,y,z;
  int segmentIndex;
};

我称这个内核的数组大约有400,000个Pt用于段,200个Pt用于质心,200个用于segmentChanges,400,000 * 6用于dotResult。这是电话:

....
thrust::device_vector<float> dotResult(pointCount*6);

printf("Errors1: %s \n",cudaGetErrorString(cudaGetLastError()));

int tpb = 1024; //threads per block
dim3 blocks = blkCnt(pointCount, tpb);
printf("blocks: %d %d\n", blocks.x, blocks.y);
dotSubCentroidNorm<<<blocks ,tpb>>>
  (
   segments,
   pointCount,
   thrust::raw_pointer_cast(centroids.data()),
   segmentChanges,
   thrust::raw_pointer_cast(dotResult.data())
  );
printf("Errors2: %s \n",cudaGetErrorString(cudaGetLastError()));
cudaThreadSynchronize();

printf("Errors3: %s \n",cudaGetErrorString(cudaGetLastError()));
....

 #define blkCnt(size, threadsPerBlock) dim3(min(255,(int)floor(1+(size)/(threadsPerBlock))),floor(1+(size)/(threadsPerBlock)/256))
#define index() (threadIdx.x + (((gridDim.x * blockIdx.y) + blockIdx.x)*blockDim.x))
....

1 个答案:

答案 0 :(得分:1)

显然,我正在传递一个主机阵列,用于&#34; segmentChanges&#34;而不是设备,这就是崩溃的原因。