CUDA用于双精度数据的扭曲

时间:2014-06-07 08:16:41

标签: cuda shuffle double-precision

CUDA程序应该减少双精度数据,我使用Julien Demouth的幻灯片“Shuffle:Tips and Tricks”

shuffle功能如下:

/*for shuffle of double-precision point */
__device__ __inline__ double shfl(double x, int lane)
{
    int warpSize = 32;
    // Split the double number into 2 32b registers.
    int lo, hi;
    asm volatile("mov.b32 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));
    // Shuffle the two 32b registers.
    lo = __shfl_xor(lo,lane,warpSize);
    hi = __shfl_xor(hi,lane,warpSize);
    // Recreate the 64b number.
    asm volatile("mov.b64 %0,{%1,%2};":"=d"(x):"r"(lo),"r"(hi));
    return x;
}

目前,我在编制程序时遇到了以下错误。

ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 71; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 271; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 287; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 302; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 317; error   : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 332; error   : Arguments mismatch for instruction 'mov'
ptxas fatal   : Ptx assembly aborted due to errors
make: *** [csr_double] error 255

有人可以给出一些建议吗?

2 个答案:

答案 0 :(得分:4)

内联汇编指令中存在语法错误,用于将32位寄存器的double参数加载。这样:

asm volatile("mov.b32 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));

应该是:

asm volatile("mov.b64 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));

在32位加载中使用“d”(即64位浮点寄存器)作为源是非法的(并且mov.b32在这里没有意义,代码必须将64位加载到两个32位寄存器)。

答案 1 :(得分:3)

自CUDA 9.0起,__shfl__shfl_up__shfl_down__shfl_xor已被弃用。

新推出的函数__shfl_sync__shfl_up_sync__shfl_down_sync__shfl_xor_sync具有以下原型:

T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int
width=warpSize);
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);

其中T可以是intunsigned intlongunsigned longlong longunsigned long long,{{1 }或float

您不再需要为双精度算术创建自己的shuffle指令。