Question

我正在尝试在CUDA中实现我自己的64位shuffle函数。但是，如果我这样做：

static __inline__ __device__ double __shfl_xor(double var, int laneMask, int width=warpSize)
{
    int hi, lo;
    asm volatile( "mov.b64 { %0, %1 }, %2;" : "=r"(lo), "=r"(hi) : "d"(var) );
    hi = __shfl_xor( hi, laneMask, width );
    lo = __shfl_xor( lo, laneMask, width );
    return __hiloint2double( hi, lo );
}

无论参数的类型是什么，所有对__shfl_xor的后续调用都将从这个64位版本实例化。例如，如果我正在做

int a;
a = __shfl_xor( a, 16 );

它仍然会使用双版本。解决方法可能使用不同的函数名称。但是因为我从模板函数调用这个shuffle函数，所以使用不同的名称意味着我必须为64位浮点创建一个不同的版本，这不是很整洁。

那么我怎样才能重载__shfl_xor（double，...）函数，同时仍然可以确保__shfl_xor（int，...）被正确调用？

Answer 1

所有积分类型和浮点数都可以升高一倍。当在内置函数和专用双函数之间进行选择时，这里的编译器可能会为所有类型选择你的。

您是否尝试使用其他名称创建具有不同名称的函数并使用该函数创建专用的双变量和其他类型的虚拟变量？

例如：

static __inline__ __device__ double foo_shfl_xor(double var, int laneMask, int width=warpSize)
{
    // Your double shuffle implementation
}

static __inline__ __device__ int foo_shfl_xor(int var, int laneMask, int width=warpSize)
{
    // For every non-double data type you use
    // Just call the original shuffle function
    return __shfl_xor(var, laneMask, width);
}

// Your code that uses shuffle
double d;
int a;
foo_shfl_xor(d, ...); // Calls your custom shuffle
foo_shfl_xor(a, ...); // Calls default shuffle

重载CUDA shuffle函数会使原始函数不可见

1 个答案: