Question

我写了一个OpenCL程序，我就像这样执行我的内核

 Loop for MultipleGPU{
 clEnqueueNDRangeKernel(commandQueues[i], kernel[i], 1, null,
        global_work_size, local_work_size, 0, new cl_event[]{userEvent}, events[i]);
 clFlush(commandQueues[i]);
 }

 long before = System.nanoTime();

 // Set UserEvent = Complete so all kernel can start executing
 clSetUserEventStatus(userEvent, CL_COMPLETE);

 // Wait until the work is finished on all command queues
 clWaitForEvents(events.length, events);

 long after = System.nanoTime();

 float totalDurationMs = (after - before) / 1e6f;

 ...profiling each events with CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END...

userEvent确保内核同时运行。资料来源：[Reima的答案]：How do I know if the kernels are executing concurrently?。

我从一个带有2个特斯拉K20M GPU的系统中得到了这个结果：

 Total duration :37.800076ms
 Duration on device 1 of 2: 38.037186
 Duration on device 2 of 2: 37.85744

有人可以向我解释为什么起始时间档案时间花费的时间超过总持续时间吗？

谢谢

Answer 1

请阅读：Timer Accuracy。

你不应该相信那些系统调用给你时间，通常他们有±1ms的准确度，除非你深入了解CPU周期（但这很困难）。然而，GPU时序非常精确（在几纳秒级别），请改用它。

编辑：如果你想测试它（为了愉悦）：将内核排队1000次并将每次执行的时间相加，然后与系统时间进行比较。在这种情况下，它永远不应该更高，因为在执行时间（38秒）内时间的准确性要小得多。

OpenCL开始 - 结束分析时间比实际持续时间长

1 个答案: