Question

我有一些使用本地内存的代码（我可能使用过寄存器，但我需要动态寻址）。由于我使用的内存量取决于输入和块中线程的数量（这也取决于输入，在运行时，虽然在发布时间之前） - 它不能是固定的 - 大小数组。另一方面，我不能写

__global__ foo(short x)
{
    int my_local_mem_array[x];
}

（即valid but problematic C99，但not valid C++甚至在主机端。）

我怎样才能达到同样的效果？到目前为止，我的想法一直是在内核大小上模拟内核，只使用核心上最大可能的正确L1内存来调用它，只使用我需要的内容。但这有点难看，因为这意味着我必须将实例化的数量乘以不同的可能最大内存大小。啊。

Answer 1

我认为模板元编程可能是做你想要的唯一现实的方式（为什么你真正想要这样做的理由不是很明显，但这是另一个问题）。鉴于本地内存需要静态编译作为每个线程堆栈帧的一部分，我没有任何其他方式来声明一个“变量”长度的本地内存数组。

当然，实例化和选择相同模板函数的许多不同版本并不是很有趣，但您可以使用类似boost preprocessor的内容来自动化所有单调乏味。

例如，考虑以下简单内核，它看起来很像您在问题中描述的模型：

#include <boost/preprocessor/arithmetic/inc.hpp>
#include <boost/preprocessor/comparison/not_equal.hpp>
#include <boost/preprocessor/repetition/for.hpp>
#include <boost/preprocessor/tuple/elem.hpp>

template<int N>
__global__ void kernel(int *out, int Nout)
{
    int scratch[N];
    for(int i=0; i<N; i++)
        scratch[i] = i - Nout;

    if (Nout > 1) {
       out[threadIdx.x] = scratch[Nout];
    }
}

#define PRED(r, state) \
   BOOST_PP_NOT_EQUAL( \
      BOOST_PP_TUPLE_ELEM(2, 0, state), \
      BOOST_PP_INC(BOOST_PP_TUPLE_ELEM(2, 1, state)) \
   ) \
   /**/

#define OP(r, state) \
   ( \
      BOOST_PP_INC(BOOST_PP_TUPLE_ELEM(2, 0, state)), \
      BOOST_PP_TUPLE_ELEM(2, 1, state) \
   ) \
   /**/

#define STUB(n) template __global__ void kernel<n>(int *, int);
#define MACRO(r, state) STUB(BOOST_PP_TUPLE_ELEM(2, 0, state));

BOOST_PP_FOR((10, 20), PRED, OP, MACRO) // generate kernel<10> ... kernel<20>

这里我使用BOOST_PP_FOR自动生成10个不同的基本内核实例：

>nvcc -arch=sm_21 -cubin -Xptxas="-v" -I ..\boost_1_60_0 template.cu

template.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelILi13EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi13EEvPii
    56 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi17EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi17EEvPii
    72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi15EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi15EEvPii
    64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi19EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi19EEvPii
    80 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi11EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi11EEvPii
    48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi16EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi16EEvPii
    64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi20EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi20EEvPii
    80 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi12EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi12EEvPii
    48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi14EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi14EEvPii
    56 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi18EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi18EEvPii
    72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi10EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi10EEvPii
    40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]

您还可以使用相同的自动化生成主机包装器函数，该函数在运行时选择正确的实例。虽然它并不理想，但它便携，快速，并且可以与CUDA工具链一起使用。

Answer 2

使用cudaMalloc使用主机分配的全局内存。在编译时未完全定义的任何数组访问将导致CUDA使用＆＃34;本地内存＆＃34;尽管名称，它只是全球记忆。或者，您可以使用new或__device__ malloc。

在CUDA中实现等效的可变长度（本地）数组

2 个答案: