我的内核中有一个顺序部分,实际上减慢了它的速度。但是,我不知道如何摆脱内循环。这里有什么建议吗?
__global__ void myKernel( int keep, int inc, int width, int* d_Xnum,
int* d_Xco, bool* d_Xvalid,int* d_A )
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int k1;
if( i < keep && j <= i){
int counter = 0;
for(k1 = 0; k1 < inc; k1++){
if(d_Xvalid[j*inc + k1] == 0)
counter += (d_Xvalid[i*inc + d_Xco[j*width + k1]]);
}
d_A[i*keep+j] = inc - d_Xnum[i] - counter;
}
}
我相信取消k1
会加快我的代码速度。但是,我没有看到如何使用counter
。任何建议,想法,想法都将受到欢迎!
这个内核叫做:
...
int t = 32;
int b = keep/(32)+1;
int b2 = (inc/32)+1;
dim3 thread (t, t);
dim3 block (b, inc);
// kernel call
myKernel<<<block, thread>>>(k, inc, width, d_Xnum,
d_Xco, d_Xvalid, d_A);
cudaThreadSynchronize();
...
keep
约为9000,inc
约为20000
答案 0 :(得分:2)
这不是您问题的准确答案,但它可能可以优化您的代码,并可能帮助您实现k1
总和的并行缩减,因为您摆脱了if( i < keep && j <= i)
。您可以根据您的gpu模型实现其他优化,例如使用纹理访问这些只读向量。
由于生成索引的方式,许多线程停止等待其他线程完成。您正在启动keep*inc
个帖子,但只有最大数量的keep*(keep+1)/2
正在执行某些操作(因为条件为j <= i
)。
我认为您可以通过以下更改来改善它:
启动keep*(keep+1)/2
个帖子
对您的代码执行以下操作
__global__ void myKernel( int keep, int inc, int width, int* d_Xnum,
int* d_Xco, bool* d_Xvalid,int* d_A )
{
int k = blockIdx.x * blockDim.x + threadIdx.x;
int i = (int)(sqrt(0.25+2.0*k)-0.5);
int j = k - i*(i+1)/2;
int k1;
if( i < keep && j < inc){
int counter = 0;
for(k1 = 0; k1 < inc; k1++){
if(d_Xvalid[j*inc + k1] == 0)
counter += (d_Xvalid[i*inc + d_Xco[j*width + k1]]);
}
d_A[i*keep+j] = inc - d_Xnum[i] - counter;
}
}
你正在做什么(对于keep = 4
启动4*4 = 16
个主题,在最好的情况下。如果是inc > keep
,就好像是这样,你会发起更多线程)可以看作(每个'盒'是一个线程)
_________________________________
| i = 0 | i = 0 | i = 0 | i = 0 |
| j = 0 | - | - | - |
_________________________________
| i = 1 | i = 1 | i = 1 | i = 1 |
| j = 0 | j = 1 | - | - |
_________________________________
| i = 2 | i = 2 | i = 2 | i = 2 |
| j = 0 | j = 1 | j = 2 | - |
_________________________________
| i = 3 | i = 3 | i = 3 | i = 3 |
| j = 0 | j = 1 | j = 2 | j = 3 |
_________________________________
我建议您根据需要添加索引k
并从中生成i
和j
(keep = 4
启动(4*(4+1)/2 = 10
个帖子)
_________________________________________________________________________________
| k = 0 | k = 0 | k = 1 | k = 0 | k = 1 | k = 2 | k = 0 | k = 1 | k = 2 | k = 3 |
| i = 0 | i = 1 | i = 1 | i = 2 | i = 2 | i = 2 | i = 3 | i = 3 | i = 3 | i = 3 |
| j = 0 | j = 0 | j = 1 | j = 0 | j = 1 | j = 2 | j = 0 | j = 1 | j = 2 | j = 3 |
_________________________________________________________________________________
这可以用
完成 i = (int)(sqrt(0.25+2*k)-0.5)
j = k - i*(i+1)/2
你可以接受这个作为食谱,或者看一下背后的数学。
要告知您,j = 0
i*(i+1)/2 = k
i
(因为k = 0+1+2+...+i = i*(i+1)/2)。现在,如果你解决这个等式,你得到j!=0
的等式(int的向下舍入并确保它在j
时得到正确的结果)。
要获得k
,如果j
为0,则应将i*(i+1)/2
减去{{1}}。{/ 1}}。