当我在Tesla C2050上的SDK(4.0)中运行simpleMultiCopy时,我得到以下结果:
[simpleMultiCopy] starting...
[Tesla C2050] has 14 MP(s) x 32 (Cores/MP) = 448 (Cores)
> Device name: Tesla C2050
> CUDA Capability 2.0 hardware with 14 multi-processors
> scale_factor = 1.00
> array_size = 4194304
Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
(compute capability >= 2.0 AND (Tesla product OR Quadro 4000/5000)
Measured timings (throughput):
Memcpy host to device : 2.725792 ms (6.154988 GB/s)
Memcpy device to host : 2.723360 ms (6.160484 GB/s)
Kernel : 0.611264 ms (274.467599 GB/s)
Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 6.060416 ms
Compute can overlap with one transfer: 5.449152 ms
Compute can overlap with both data transfers: 2.725792 ms
Average measured timings over 10 repetitions:
Avg. time when execution fully serialized : 6.113555 ms
Avg. time when overlapped using 4 streams : 4.308822 ms
Avg. speedup gained (serialized - overlapped) : 1.804733 ms
Measured throughput:
Fully serialized execution : 5.488530 GB/s
Overlapped using 4 streams : 7.787379 GB/s
[simpleMultiCopy] test results...
PASSED
这表明预期的运行时间为2.7毫秒,而实际需要4.3毫秒。究竟是什么造成了这种差异? (我也在http://forums.developer.nvidia.com/devforum/discussion/comment/8976发布了这个问题。)
答案 0 :(得分:1)
第一个内核启动在第一个memcpy完成之前无法启动,并且在最后一个内核启动完成之前,最后的memcpy无法启动。因此,有“悬垂”会引入您正在观察的一些开销。您可以通过增加流的数量来减小“悬垂”的大小,但是流的引擎间同步会产生自己的开销。
重要的是要注意,重叠的计算+传输并不总是有益于给定的工作负载 - 除了上面描述的开销问题之外,工作负载本身也必须花费相同的时间进行计算和数据传输。由于Amdahl定律,当工作负载变为传输或计算限制时,2x或3x的潜在加速下降。