Question

我试图使用Cuda实现一个通用类，用于常规算法，如Reduce或Scan，提供一些预处理，例如算法中的简单映射。该映射操作在实际的缩减/扫描算法之前执行。为了实现这一点，我希望使用lambda函数。以下是我尝试实现此方法的方式。

template<typename T> void __device__ ReduceOperationPerThread(T * d_in, T * d_out, unsigned int size)
{
    //Actual Reduce Algorithm Comes here 
}

template<typename T, typename LAMBDA> 
__global__ void ReduceWithPreprocessing(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
    lam();

    ReduceOperationPerThread(d_in, d_out, size);
}

调用此内核的辅助函数创建如下，

template<typename T, typename LAMBDA>
void Reduce(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
    // preparing block sizes, grid sizes
    // and additional logic for invoking the kernel goes here
    // with the Kernel invocation as following

    ReduceWithPreprocessing<T><<<gridSize, blockSize>>>(d_in, d_out, size, lam)
}

所有上述代码都包含在名为Reduce.cu的源代码中，相应的头文件创建为Reduce.h，如下所示

// Reduce.h
template<typename T, typename LAMBDA>
void Reduce(T * d_in, T * d_out, unsigned int size, LAMBDA lam);

所以在一天结束时，完整的Reduce.cu看起来像这样，

// Reduce.cu
template<typename T> void __device__ ReduceOperationPerThread(T * d_in, T * d_out, unsigned int size)
{
    //Actual Reduce Algorithm Comes here 
}

template<typename T, typename LAMBDA> 
__global__ void ReduceWithPreprocessing(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
    lam();

    ReduceOperationPerThread(d_in, d_out, size);
}

template<typename T, typename LAMBDA>
void ReduceWPreprocessing(T * d_in, T * d_out, unsigned int size, LAMBDA lam)
{
    // preparing block sizes, grid sizes
    // and additional logic for invoking the kernel goes here
    // with the Kernel invocation as following

    ReduceWithPreprocessing<T><<<gridSize, blockSize>>>(d_in, d_out, size, lam)
}

但我遇到的问题与在单独的.h和.cu文件中编写模板函数有关

在不使用lambda函数的正常情况下，我以前做的是添加函数的所有可能实现，并在.cu文件的末尾添加模板参数的可能值，如{{3常见问题 - ＆＃34;如何避免模板类的链接器错误？＆＃34;

// At the end of the Reduce.cu file
// Writing functions with possible template values 
// For A normal Reduce function

template void Reduce<double>(double * d_in, double * d_out, unsigned int size);
template void Reduce<float>(float * d_in, float* d_out, unsigned int size);
template void Reduce<int>(int * d_in, int * d_out, unsigned int size);

但是在这种情况下，无法预定义模板参数LAMBDA的可能值。

template void ReduceWPreprocessing<int>(int * d_in, int * d_out, unsigned int size, ??? lambda);

是否有另一种方法可以将lambda函数用于此类应用程序？

Answer 1

[将评论汇总到社区Wiki答案中可以使该问题摆脱未答复的队列]

在发布问题时，由于CUDA缺少可捕获lambda表达式的占位符机制，因此无法执行该问题。

但是，CUDA（自2017年第一季度发布的第8版开始）现在具有std::function之类的称为nvfunctional的多态函数包装器。这样一来，您就可以为lambda表达式定义通用类型，该类型可以在实例化期间用作模板参数，然后捕获作为参数传递的lambda并以通用方式对其进行调用。

在CUDA中使用带有模板函数的Lambda函数

1 个答案: