我在亚马逊的K520 GPU上运行,拥有1500核和4GB RAM。我正在尝试运行1024 * 850线程的内核。我知道每个块最多只能获得1024个线程,但是当我无法使用每个块1024个线程启动超过255个块时,我感到很惊讶(我得到了启动错误)。我认为网格尺寸的限制是2 ^ 16。当我运行一个空核时,它会很好地完成它。这让我觉得某处没有足够的记忆。我想知道我是否可以得到关于发生了什么的解释。谢谢。这是内核:
__global__ void dotSubCentroidNorm
(
Pt* segments,
int pointCount,
const Pt* centroids,
const int* segmentChanges,
float *dotResult
)
{
int idx = index();
if(idx>=pointCount)
return;
int segment = segments[idx].segmentIndex;
if(segment<0)
return;
int segPtCount = segmentChanges[segment+1]-segmentChanges[segment];
Pt &pt = segments[idx];
if(segPtCount==0)
{
printf("segment pt count =0 %d %d\n",idx, segment);
return;
}
const Pt &ctr = centroids[segment];
pt.x=pt.x-ctr.x/segPtCount;
pt.y=pt.y-ctr.y/segPtCount;
pt.z=pt.z-ctr.z/segPtCount;
dotResult[idx] = pt.x*pt.x;
dotResult[pointCount + idx] = pt.x*pt.y;
dotResult[pointCount*2 + idx] = pt.x*pt.z;
dotResult[pointCount*3 + idx] = pt.y*pt.y;
dotResult[pointCount*4 + idx] = pt.y*pt.z;
dotResult[pointCount*5 + idx] = pt.z*pt.z;
}
和结构:
struct Pt
{
float x,y,z;
int segmentIndex;
};
我称这个内核的数组大约有400,000个Pt用于段,200个Pt用于质心,200个用于segmentChanges,400,000 * 6用于dotResult。这是电话:
....
thrust::device_vector<float> dotResult(pointCount*6);
printf("Errors1: %s \n",cudaGetErrorString(cudaGetLastError()));
int tpb = 1024; //threads per block
dim3 blocks = blkCnt(pointCount, tpb);
printf("blocks: %d %d\n", blocks.x, blocks.y);
dotSubCentroidNorm<<<blocks ,tpb>>>
(
segments,
pointCount,
thrust::raw_pointer_cast(centroids.data()),
segmentChanges,
thrust::raw_pointer_cast(dotResult.data())
);
printf("Errors2: %s \n",cudaGetErrorString(cudaGetLastError()));
cudaThreadSynchronize();
printf("Errors3: %s \n",cudaGetErrorString(cudaGetLastError()));
....
#define blkCnt(size, threadsPerBlock) dim3(min(255,(int)floor(1+(size)/(threadsPerBlock))),floor(1+(size)/(threadsPerBlock)/256))
#define index() (threadIdx.x + (((gridDim.x * blockIdx.y) + blockIdx.x)*blockDim.x))
....
答案 0 :(得分:1)
显然,我正在传递一个主机阵列,用于&#34; segmentChanges&#34;而不是设备,这就是崩溃的原因。