Question

我正在用CUDA编写我的第一个实际应用程序，而现在我已经知道要执行内核需要花费多长时间。但是，正如标题中所述，我不明白为什么在多次运行内核的应用程序中， second 启动内核所需的时间要比启动内核所需的时间短得多。第一个。

例如，在下面的代码中：

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <chrono>
#include <iostream>
#include <stdio.h>

void runCuda(unsigned int size);

__global__ void addKernel(const int arraySize)
{
    1 + 1;
}

void doStuff(int arraySize)
{
    auto t1 = std::chrono::high_resolution_clock::now();
    addKernel <<<(arraySize + 31) / 32, 32 >>> (arraySize);
    cudaDeviceSynchronize();
    auto t2 = std::chrono::high_resolution_clock::now();

    std::cout << "Duration: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
    cudaDeviceReset();
}

int main()
{
    doStuff(1e6);
    doStuff(1e6);

    return 0;
}

内核仅做一些基本加法，被称为一百万次。上面程序的输出通常是这样的：

Duration: 1072
Duration: 97

这两个数字发生了变化，但是始终保持在1000和100左右。同一内核第二次运行速度如此之快的事实对我来说毫无意义。

Answer 1

程序启动第一个Cuda内核时会产生开销。检查内核的运行时间时，应该首先启动空白内核

Answer 2

可能是因为GPU / CPU有工作要做，所以它正在提高时钟速度。操作系统调度也可能会干扰，但这并不是您在这里遇到的主要问题。

这样的代码执行时间通常意味着至少要对多次运行平均，如果您想做得更好，则要排除异常值。

我敢肯定，如果您再添加几行def b = 1 def map = [:] map."a${b}" = 1 assert map."a${b}" == 1 println(map) // result is [a1:1]，它们将比第二行更接近第二行。

Answer 3

您会发现，在第一次运行时，几乎所有的额外时间都花在了第一个cudaMalloc（）上。这是一个初始化过程，它将确定只能部分缓解的设备以及交换和内存条件。

Answer 4

可以在“ CUDA C ++最佳实践指南”中找到更好的内核计时方法，例如以下代码：

cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord( start, 0 );
kernel<<<grid,threads>>> ( d_odata, d_idata, size_x, size_y,
 NUM_REPS);
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
cudaEventElapsedTime( &time, start, stop );
cudaEventDestroy( start );

Answer 5

我没有在这个设置中工作，但是最有可能在第一次运行时内核需要编译。用于GPU的着色器必须在运行时进行编译，因为每个设计对它的编译都略有不同。否则，您将不得不制作尽可能多的可执行文件，并且每个OS的版本都不同，以及导致代码编译（驱动程序版本）的其他因素。

CUDA内核在第二次运行时运行速度更快-为什么？

5 个答案: