Question

有人知道为什么设备上的向量分配对于在调试模式下编译的第一次运行需要太多吗？在我的特定情况下（NVIDIA Quadro 3000M，Cuda Toolkit 6.0，Windows 7，MSVC2010）首次运行Debug编译版本需要40秒以上，下一次（无重新编译）运行需要少10倍（发布版本的设备上的矢量分配占用1秒）。

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <cstdlib>

#include <ctime>

int main(void) {
    clock_t t; 

    t = clock();
    thrust::host_vector<int> h_vec( 100);
    clock_t dt = clock() - t;
    printf ("allocation on host - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    t = clock();
    thrust::generate(h_vec.begin(), h_vec.end(), rand);
    dt = clock() - t;
    printf ("initialization on host - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    t = clock();
    thrust::device_vector<int> d_vec( 100); // First run for Debug compiled version takes over 40 seconds...
    dt = clock() - t;
    printf ("allocation on device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    t = clock();
    d_vec[0] = h_vec[0];
    dt = clock() - t;
    printf ("copy one to device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    t = clock();
    d_vec = h_vec;
    dt = clock() - t;
    printf ("copy all to device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    t = clock();
    thrust::sort(d_vec.begin(), d_vec.end());
    dt = clock() - t;
    printf ("sort on device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    t = clock();
    thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
    dt = clock() - t;
    printf ("copy to host - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    t = clock();
    for(int i=0; i<10; i++)
        printf("%d\n", h_vec[i]);
    dt = clock() - t;
    printf ("output - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    std::cin.ignore();
    return 0;
}

Answer 1

大多数情况下，测量第一个矢量实例化并不是矢量分配和初始化的成本，而是与CUDA运行时和驱动程序相关的开销成本。我想如果你把你的代码改成这样的话：

int main(void) {
    clock_t t; 

    ....

    cudaFree(0); // This forces context establishment and lazy runtime overheads

    t = clock();
    thrust::device_vector<int> d_vec( 100); // First run for Debug compiled version takes over 40 seconds...
    dt = clock() - t;
    printf ("allocation on device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);


    .....

你应该看到，你测量在第一次和第二次运行之间分配矢量的时间变得相同，即使运行程序的挂钟时间显示出很大的差异。

我没有很好的解释为什么在第一次和第二次运行之间的启动时间有这么大的差异，但是如果我冒险猜测，那就是有一些驱动程序级别的JIT重新编译在第一次运行时执行，驱动程序缓存代码以便后续运行。要检查的一件事是，您正在为GPU的正确架构编译代码，这将消除驱动程序重新编译作为时差的来源。

nvprof实用程序可以为您提供API跟踪和计时。您可能希望运行它并查看API调用序列中的时间差异来自何处。你看到某种驱动程序错误的影响并不超出可能性范围，但没有更多的信息就无法说出来。

Answer 2

在我的情况下（NVIDIA Quadro 3000M，Cuda Toolkit 6.0，Windows 7，MSVC2010）看起来问题是通过将项目CUDA C / C ++ /代码生成选项从compute_10，sm_10更改为compute_20，sm_20来解决问题。 GPU架构。所以今天我很开心）

NVidia CUDA推力设备矢量分配太慢

2 个答案: