Question

保持CUDA kenel的寄存器/线程数低是否有任何好处？

我认为没有优势（速度或其他方面）。对于3 reg / thread，上下文切换速度与48 regs / thread一样快。没有使用所有可用的寄存器是没有意义的，除非你不想这样做。内核之间不共享寄存器。这是错的吗？

修改来自CUDA4.2编程指南（5.2.3）：

    The number of registers used by a kernel can have a significant impact on the number 
    of resident warps. For example, for devices of compute capability 1.2, if a kernel uses 16 
registers and each block has 512 threads and requires very little shared memory, then two 
    blocks (i.e. 32 warps) can reside on the multiprocessor since they require 2x512x16 
    registers, which exactly matches the number of registers available on the multiprocessor.
     But as soon as the kernel uses one more register, only one block (i.e. 16 warps) can be 
    resident since two blocks would require 2x512x17 registers, which are more registers than 
    are available on the multiprocessor. Therefore, the compiler attempts to minimize register 
    usage while keeping register spilling (see Section 5.3.2.2) and the number of instructions 
    to a minimum.

“regs / thread”计数似乎与总注册计数无关。

Answer 1

由于每个多处理器的寄存器总数有限，因此使用的寄存器数会影响GPU的占用率。

请参阅CUDA Occupancy calculator

您可以输入计算能力，共享内存大小配置值，每个块的线程数，每个线程的寄存器数以及每个块的共享内存字节数。

该工作表将为您提供有关每个多处理器（mp）将运行多少个线程的信息，活动的warp数，每个mp的线程块数以及每个mp的占用率。

事实上，这取决于你的问题，但你希望尽可能高的占用率，以避免浪费资源。另一方面，如果寄存器的数量受到限制，则代码可能会变慢。

所以可能有一点不使用所有寄存器以避免低占用率，但正如我所说，这是一个权衡的事情。

Answer 2

由于许多块可以在单个SM上运行，因此每个线程分配过多的寄存器会影响性能。你是SM上的硬件限制 - 如果你的SM变得“饱和”了10个块（即它不必等待块来完成内存访问，因为它有其他工作要做），但每个块使用1/5的在该SM注册，您的利用率将低于标准。

对于共享内存也是如此，共享内存受限（IIRC）到每SM约32k。（+/-取决于您的GPU /架构）

每个线程的寄存器数量

2 个答案: