我正在尝试使用!$ acc缓存来处理Laplace 2D解算器中的特定循环。当我使用-Mcuda = ptxinfo分析代码时,它显示没有使用共享内存(smem)但代码运行速度比基本条件慢?!
以下是代码的一部分:
!$acc parallel loop reduction(max:error) num_gangs(n/THREADS) vector_length(THREADS)
do j=2,m-1
do i=2,n-1
#ifdef SHARED
!$acc cache(A(i-1:i+1,j),A(i,j-1:j+1))
#endif
Anew(i,j) = 0.25 * ( A(i+1,j) + A(i-1,j) + A(i,j-1) + A(i,j+1) )
error = max( error, abs( Anew(i,j) - A(i,j) ) )
end do
end do
!$acc end parallel
这是使用!$ acc cache
的输出ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_20'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 28 registers, 96 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_20'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 12 registers, 96 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_20'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 64 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_30'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 37 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_30'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_30'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 352 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_35'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 38 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_35'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_35'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 39 registers, 352 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_50'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 37 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_50'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 12 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_50'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 30 registers, 352 bytes cmem[0]
这是没有缓存的输出:
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_20'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 23 registers, 88 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_20'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 12 registers, 88 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_20'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 64 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_30'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 29 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_30'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_30'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 352 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_35'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 36 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_35'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_35'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 39 registers, 352 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_50'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 38 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_50'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 12 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_50'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 30 registers, 352 bytes cmem[0]
还通过-Minfo = accel显示已缓存了一些内存:
acc_lap2d:
17, Generating copy(a(:4096,:4096))
Generating create(anew(:4096,:4096))
39, Accelerator kernel generated
Generating Tesla code
39, Max reduction generated for error
40, !$acc loop gang(256) ! blockidx%x
41, !$acc loop vector(16) ! threadidx%x
Cached references to size [(x)x3] block of a
Loop is parallelizable
58, Accelerator kernel generated
Generating Tesla code
59, !$acc loop gang ! blockidx%x
60, !$acc loop vector(128) ! threadidx%x
Loop is parallelizable
我想知道如何在OpenACC中有效地使用缓存(CUDA意义上的共享内存)?
非常感谢你的帮助。
Behzad
答案 0 :(得分:3)
编译器应将此标记为错误。您不能在同一个缓存指令中列出两次相同的变量。由于我为PGI工作,我已经添加了技术问题报告(TPR#21898),要求我们检测到此错误。虽然在当前的OpenACC规范中并非特别违法,但我们会将其与标准委员会联系起来。问题是编译器无法分辨在两种情况下使用哪两个缓存数组。
修复方法是将两个引用结合起来:
!$acc cache(A(i-1:i+1,j-1:j+1))
请注意,PTX信息不会显示共享内存使用情况,因为这只显示固定大小的共享内存。我们在启动CUDA内核时动态调整共享内存大小。在查看生成的CUDA C代码(-ta = tesla:nollvm,keep)时,我看到共享内存引用正在生成。
另请注意,使用共享内存并不能保证更好的性能。填充共享数组会产生开销,生成的内核需要同步线程。除非有很多重用,否则#34;缓存"可能没有益处。
如果PGI编译器可以通过分析或使用" INTENT(IN)"来确定数组是"只读",并且我们是针对计算能力为3.5或更高的设备,我们将尝试使用纹理内存。在这种情况下,把" A"在纹理记忆中可能更有益。
希望这有帮助, 垫