cuda fortran中的共享内存未按预期工作

时间:2013-07-04 06:05:25

标签: cuda gpu shared-memory gpgpu

我正在建造一个cuda fortran并且发生了一种奇怪的行为。我真的不明白为什么我的代码会像这样运行,并感谢你的帮助。

似乎永远不会分配值0甚至是循环 在寄宿生之外执行。

我试图在循环之后放置if条件,但它也没有帮助。 谢谢你的帮助

    real, shared :: s_d_aaa_adk(0:15,0:15)
    real, shared :: s_d_bbb_adk(0:15,0:15)
    real, shared :: s_d_ccc_adk(0:15,0:15)

    d_k = (blockIdx%x-1)
    s_d_j = threadIdx%x-1
    s_d_l = threadIdx%y-1   

    if(d_k == kmax-1)then
        s_d_aaa_adk(s_d_j,s_d_l)  = 0 
        s_d_bbb_adk(s_d_j,s_d_l) = 0
        s_d_ccc_adk(s_d_j,s_d_l)  = 0       
    endif

    do d_k = 0, kmax-2              
        s_d_bbb_adk(s_d_j,s_d_l) = d_bbb(s_d_j,d_l,d_k+1)
        s_d_ccc_adk(s_d_j,s_d_l)  = d_ccc(d_j,s_d_l,d_k+1) 
        s_d_aaa_adk(s_d_j,s_d_l) = d_aaa(d_j,s_d_l,d_k+1)               
    end do  `

我将所有全局内存数组大小设置为(16,16,kmax), 网格是(128,1,1),块(16,16,1)和 内核以testkernell<<<grid,block>>>()

启动

1 个答案:

答案 0 :(得分:1)

因为您要调整d_k上的if语句,该语句是从块索引派生的:

d_k = (blockIdx%x-1)
if(d_k == kmax-1)then

这意味着网格中128个中只有一个块实际上会执行if语句,将这些特定的共享内存值设置为零。大多数块都不会执行if语句中的内容。

如果kmax碰巧大于128,那么你的所有块都不会执行if语句。

如果您希望在每个线程块中执行if语句,则需要在块索引以外的其他内容上对其进行条件化。

我会就如何重新构建代码提出建议,但就我将数据加载到共享内存中而言,我想要实现的目标并不清楚。例如,你的do循环对我来说没有多大意义:

do d_k = 0, kmax-2              
    s_d_bbb_adk(s_d_j,s_d_l) = d_bbb(s_d_j,d_l,d_k+1)
    s_d_ccc_adk(s_d_j,s_d_l)  = d_ccc(d_j,s_d_l,d_k+1) 
    s_d_aaa_adk(s_d_j,s_d_l) = d_aaa(d_j,s_d_l,d_k+1)               
end do            ^     ^
                  |     |
         a given thread has specific values for these indices

您的s_d_js_d_l变量是线程索引。因此给定的线程将看到这个do循环,它将迭代地执行循环,将来自各种全局内存数组(d_bbbd_ccc等)的连续值加载到中完全相同每个共享内存阵列中的位置。

在我看来,你并不真正了解线程执行的工作原理。假设您是给定的线程,将特定值分配给s_d_js_d_l(以及d_k,尽管在重复使用该变量作为循环索引时,您将覆盖块索引,对我来说也很奇怪),然后看看你的代码执行是否有意义。

编辑:根据其他评论:

您已声明您的整体数据集大小(x,y,z)为(64,64,32)。 你已经说过&#34;我正在切片......阵列到z。 ...我想将每个切片放在一个块中#34;

这表明你应该每片启动一个块。或者你可能有一个算法,它有多个块分配给一个切片。无论如何,我将假设您希望所有切片数据(64,64)可用于分配给该切片的给定块。我现在假设你将推出32个街区。不应该难以扩展到多个块在单个片上工作的情况。我还假设一个32x32线程块,而不是你指出的16x16。如果你愿意的话,扩展它以使用16x16应该不难。

你可以这样做:

real, shared :: s_d_aaa_adk(0:63,0:63)
real, shared :: s_d_bbb_adk(0:63,0:63)
real, shared :: s_d_ccc_adk(0:63,0:63)

c above uses 48KB of shared mem, so assuming cc 2.0+ and cache config set accordingly

d_k = (blockIdx%x-1)
s_d_j = threadIdx%x-1
s_d_l = threadIdx%y-1   

c fill first quadrant
s_d_bbb_adk(s_d_j,s_d_l) = d_bbb(s_d_j,s_d_l,d_k+1)
s_d_ccc_adk(s_d_j,s_d_l) = d_ccc(s_d_j,s_d_l,d_k+1) 
s_d_aaa_adk(s_d_j,s_d_l) = d_aaa(s_d_j,s_d_l,d_k+1)
c fill second quadrant
s_d_bbb_adk(s_d_j+blockDim%x,s_d_l) = d_bbb(s_d_j+blockDim%x,s_d_l,d_k+1)
s_d_ccc_adk(s_d_j+blockDim%x,s_d_l) = d_ccc(s_d_j+blockDim%x,s_d_l,d_k+1) 
s_d_aaa_adk(s_d_j+blockDim%x,s_d_l) = d_aaa(s_d_j+blockDim%x,s_d_l,d_k+1)
c fill third quadrant
s_d_bbb_adk(s_d_j,s_d_l+blockDim%y) = d_bbb(s_d_j,s_d_l+blockDim%y,d_k+1)
s_d_ccc_adk(s_d_j,s_d_l+blockDim%y) = d_ccc(s_d_j,s_d_l+blockDim%y,d_k+1) 
s_d_aaa_adk(s_d_j,s_d_l+blockDim%y) = d_aaa(s_d_j,s_d_l+blockDim%y,d_k+1)
c fill fourth quadrant
s_d_bbb_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = d_bbb(s_d_j+blockDim%x,s_d_l+blockDim%y,d_k+1)
s_d_ccc_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = d_ccc(s_d_j+blockDim%x,s_d_l+blockDim%y,d_k+1) 
s_d_aaa_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = d_aaa(s_d_j+blockDim%x,s_d_l+blockDim%y,d_k+1)


c just guessing about what your intent was on filling with zeroes
c this just makes sure that one of the slices at the end gets zeroes
c instead of the values from the global arrays

if(d_k == kmax-1)then
c fill first quadrant
    s_d_bbb_adk(s_d_j,s_d_l) = 0
    s_d_ccc_adk(s_d_j,s_d_l) = 0
    s_d_aaa_adk(s_d_j,s_d_l) = 0
c fill second quadrant
    s_d_bbb_adk(s_d_j+blockDim%x,s_d_l) = 0
    s_d_ccc_adk(s_d_j+blockDim%x,s_d_l) = 0
    s_d_aaa_adk(s_d_j+blockDim%x,s_d_l) = 0
c fill third quadrant
    s_d_bbb_adk(s_d_j,s_d_l+blockDim%y) = 0
    s_d_ccc_adk(s_d_j,s_d_l+blockDim%y) = 0
    s_d_aaa_adk(s_d_j,s_d_l+blockDim%y) = 0
c fill fourth quadrant
    s_d_bbb_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = 0
    s_d_ccc_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = 0
    s_d_aaa_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = 0     
endif