Question

我正在尝试为图像上的逻辑操作定义模板CUDA内核。代码如下所示：

#define AND 1
#define OR 2
#define XOR 3
#define SHL  4
#define SHR 5 

template<typename T, int opcode> 
__device__ inline T operation_lb(T a, T b)
{
    switch(opcode)
    {
    case AND:
        return a & b;
    case OR:
        return a | b;
    case XOR:
        return a ^ b;
    case SHL:
        return a << b;
    case SHR:
        return a >> b;
    default:
        return 0;
    }
}

//Logical Operation With A Constant
template<typename T, int channels, int opcode> 
__global__ void kernel_logical_constant(T* src, const T val, T* dst, int width, int height, int pitch)
{
    const int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
    const int yIndex = blockIdx.y * blockDim.y + threadIdx.y;

    if(xIndex >= width || yIndex >= height) return;

    unsigned int tid = yIndex * pitch + (channels * xIndex);

    #pragma unroll
    for(int i=0; i<channels; i++)
        dst[tid + i] = operation_lb<T,opcode>(src[tid + i],val);
}

问题在于，当我实例化内核进行位移时，会出现以下编译错误

错误1错误：由于错误导致Ptx程序集中止

内核瞬间是这样的：

template __global__ void kernel_logical_constant<unsigned char,1,SHL>(unsigned char*,unsigned char,unsigned char*,int,int,int);

unsigned char，unsigned short，1和3频道以及所有逻辑操作还有19个这样的瞬间。但只有位移位时刻，即SHL和SHR会导致错误。当我删除这些瞬间时，代码编译并完美地工作。如果我用operation_lb设备函数内的任何其他操作替换位移，代码也可以工作。我想知道这是否与由于内核的许多不同时刻产生的ptx代码量有关。

我使用的是CUDA 5.5，Visual Studio 2010，Windows 8 x64。正在编译compute_1x, sm_1x。

任何帮助都将不胜感激。

Answer 1

原始问题指出海报正在使用compute_20, sm_20。有了这个，我无法使用代码here重现错误。但是，在评论中指出实际使用sm_10。当我切换到编译sm_10时，我能够重现错误。

出现是编译器中的错误。我这样说只是因为我不相信编译器应该生成汇编程序无法处理的代码。但除此之外，我不了解潜在的根本原因。我已经向NVIDIA提交了一份错误报告。

在我的有限测试中，似乎仅在unsigned char而非int时才会发生。

作为可能的解决方法，对于cc2.0和更新的设备，请在编译时指定-arch=sm_20。

CUDA内核模板实例化导致编译错误

1 个答案: