Question

我尝试使用cuFFT的回调功能即时执行输入格式转换（例如，计算8位整数输入数据的FFT，而无需先将输入缓冲区显式转换为float）。在我的许多应用程序中，我需要在输入缓冲区as described in this previous SO question上计算重叠 FFT。通常，相邻的FFT可能会重叠FFT长度的1/4到1/8。

cuFFT及其类似FFTW的接口明确支持此via the idist parameter of the cufftPlanMany() function。具体来说，如果我想计算大小为32768的FFT，连续输入之间有4096个样本的重叠，我会设置idist = 32768 - 4096。确实在产生正确输出的意义上正常工作。

然而，当我以这种方式使用cuFFT时，我发现性能会出现奇怪的下降。我设计了一个测试，它以两种不同的方式实现这种格式转换和重叠：

明确告诉cuFFT输入的重叠性：如上所述设置idist = nfft - overlap。安装一个加载回调函数，只需根据需要对提供给回调的缓冲区索引进行从int8_t到float的转换。
不要告诉cuFFT输入的重叠性质;骗了idist = nfft。然后，让回调函数通过计算每个FFT输入应该读取的正确索引来处理重叠。

A test program implementing both of these approaches with timing and equivalence tests is available in this GitHub gist。为简洁起见，我没有在这里重现这一切。该程序计算一批1024个32768点FFT，重叠4096个样本;输入数据类型是8位整数。当我在我的机器上运行它（使用Geforce GTX 660 GPU，在Ubuntu 16.04上使用CUDA 8.0 RC）时，我得到以下结果：

executing method 1...done in 32.523 msec
executing method 2...done in 26.3281 msec

方法2明显更快，我不指望。查看回调函数的实现：

方法1：

template <typename T>
__device__ cufftReal convert_callback(void * inbuf, size_t fft_index, 
    void *, void *)
{
    return (cufftReal)(((const T *) inbuf)[fft_index]);
}

方法2：

template <typename T>
__device__ cufftReal convert_and_overlap_callback(void *inbuf, 
    size_t fft_index, void *, void *)
{
    // fft_index is the index of the sample that we need, not taking 
    // the overlap into account. Convert it to the appropriate sample 
    // index, considering the overlap structure. First, grab the FFT 
    // parameters from constant memory.
    int nfft = overlap_params.nfft;
    int overlap = overlap_params.overlap;
    // Calculate which FFT in the batch that we're reading data for. This
    // tells us how much overlap we need to account for. Just use integer 
    // arithmetic here for speed, knowing that this would cause a problem 
    // if we did a batch larger than 2Gsamples long.
    int fft_index_int = fft_index;
    int fft_batch_index = fft_index_int / nfft;
    // For each transform past the first one, we need to slide "overlap" 
    // samples back in the input buffer when fetching the sample.
    fft_index_int -= fft_batch_index * overlap;
    // Cast the input pointer to the appropriate type and convert to a float.
    return (cufftReal) (((const T *) inbuf)[fft_index_int]);
}

方法2有一个显着更复杂的回调函数，它甚至涉及非编译时间值的整数除法！我希望这比方法1慢得多，但我看到相反的情况。对此有一个很好的解释吗？当输入重叠时，cuFFT是否可能以不同的方式构建其处理，从而导致性能下降？

似乎我应该能够实现比方法2快得多的性能，如果可以从回调中删除索引计算（但是这需要将重叠指定为CUFFT）。

编辑：在nvvp下运行我的测试程序之后，我可以看到cuFFT肯定会以不同的方式构建其计算。很难理解内核符号名称，但内核调用会像这样分解：

方法1：

__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex14packR2C_kernelIjfEEvNS_19spRealComplexR2C_stIT_T0_EE：3.72毫秒
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>：7.71毫秒
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>：12.75毫秒（是的，它会被调用两次）
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_：7.49毫秒

方法2：

spRadix0128C::kernel1MemCallback<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, L1, ALL, WRITEBACK>：5.15毫秒
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>：12.88毫秒
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_：7.51毫秒

有趣的是，看起来cuFFT调用两个内核来实际使用方法1计算FFT（当cuFFT知道重叠时），但是使用方法2（它不知道FFT是重叠的），它只做一个工作。对于在两种情况下使用的内核，它似乎在方法1和2之间使用相同的网格参数。

我不明白为什么它必须在这里使用不同的实现，特别是因为输入步幅istride == 1。在转换输入处获取数据时，它应该只使用不同的基址;我认为算法的其余部分应完全相同。

编辑2：我看到一些更奇怪的行为。我意外地意识到如果我没有适当地破坏cuFFT手柄，我会看到测量性能的差异。例如，我修改了测试程序以跳过cuFFT句柄的销毁，然后以不同的顺序执行测试：方法1，方法2，然后方法2和方法1。我得到了以下结果：

executing method 1...done in 31.5662 msec
executing method 2...done in 17.6484 msec
executing method 2...done in 17.7506 msec
executing method 1...done in 20.2447 msec

因此，在为测试用例创建计划时，性能似乎会发生变化，具体取决于是否存在其他cuFFT计划！使用分析器，我看到内核启动的结构在两种情况之间没有变化;内核似乎都执行得更快。我对此效果也没有合理的解释。

Answer 1

如果指定非标准步幅（无论批量/转换无关紧要），cuFFT会在内部使用不同的路径。

广告编辑2：这可能是GPU Boost在GPU上调整时钟。 cuFFT计划对另一个没有影响

获得更稳定结果的方法：

运行预热内核（任何可以使GPU完全正常工作的东西）然后你的问题
增加批量大小
多次运行测试并取平均值
锁定GPU的时钟（在GeForce上真的不可能 - 特斯拉可以做到）

Answer 2

根据@llukas的建议，我向NVIDIA提交了有关此问题的错误报告（https://partners.nvidia.com/bug/viewbug/1821802如果您已注册为开发人员）。他们承认重叠计划的表现较差。他们实际上表明在两种情况下使用的内核配置都不是最理想的，他们计划最终改进。没有给出ETA，但它很可能不在下一个版本中（8.0刚刚在上周发布）。最后，他们说，从CUDA 8.0开始，没有一种解决方法可以让cuFFT使用更有效的方法来进行跨步输入。

为什么cuFFT性能受到重叠输入的影响？

2 个答案: