我的测试功能是这样的。
DIMENSION 20
POPSIZE 5000
__global__ void repairT(int* H, int* diff){
int tidx = blockDim.x * blockIdx.x + threadIdx.x;
int ii = tidx * DIMENSION;
//if (ii < DIMENSION * POPSIZE)
//{
int Hdiff[DIMENSION] = { 0 };
int diffcount = 0;
bool isInIndiv = false;
//complement set H
for (int i = 1; i <= DIMENSION; i++)
{
for (int j = ii; j < ii + DIMENSION; j++) //H for
{
if (i == H[j])
{
isInIndiv = isInIndiv || true;
}
}
if (isInIndiv == false)
{
Hdiff[diffcount] = i;
diffcount++;
}
else
isInIndiv = false;
}
// diff to array
int diffc = ii * DIMENSION;
for (int i = 0; i < DIMENSION; i++)
{
diff[diffc] = Hdiff[i];
diffc++;
}
//}
}
我有一个叫做H(POPSIZE * DIMENSION)的大型一维数组。我想创建新的数组diff,它可以在0-19,20-39等区间保存缺少的元素......
我需要在parralel 5000次中有效地执行此代码 我尝试了这个,但它只在H
中执行0-19区间dim3 nbThreadsR1(128);
dim3 nbBlocksR1((POPSIZE / nbThreadsR1.x) + 1);
repairT << <nbBlocksR1, nbThreadsR1 >> >(d_H, d_diff);
请给我一些建议。
答案 0 :(得分:1)
您访问H并且diff不是coalesced,这意味着内存单元效率不高。您希望重新排序数据或更改代码以便合并访问。
此外,您似乎正在阅读H [j]很多次。你可能想要定义另一个小数组Hcache预加载它以避免过多的读取:
int Hcache[DIMENSION];
for (int j = 0; j < DIMENSION; j++) //H for
{
Hcache[j] = H[j+ii];
}
for (int i = 1; i <= DIMENSION; i++)
{
for (int j = 0; j < ii; j++) //H for
{
if (i == Hcache[j])
{
isInIndiv = isInIndiv || true;
}
}
if (isInIndiv == false)
{
Hdiff[diffcount] = i;
diffcount++;
}
else
isInIndiv = false;
}
最后,您要确保编译器在寄存器上获得足够的自由度,并且您的设备可以处理那么多,以便Hcache和Hdiff存储在寄存器文件中(请参阅maxrregcount选项here)。