为什么OpenCL内存分配在我的(某些)NVIDIA GPU上如此之慢?

时间:2017-09-15 15:27:58

标签: opencl gpu gpgpu

最近,在重新启动我们实验室的GPU​​计算节点后,通过OpenCL API(带clCreateBuffer())分配内存的成本似乎已经上升。我以前观察到分配时间对于分配的字节数几乎是不变的,但突然之间似乎存在强相关性。例如,使用http://lpaste.net/raw/356490上的基准程序,我使用在NVIDIA K40 GPU上获得以下结果:

1 bytes; average: 382us; min: 364us; max: 501us 2 bytes; average: 376us; min: 364us; max: 478us 4 bytes; average: 372us; min: 364us; max: 474us 8 bytes; average: 375us; min: 364us; max: 405us 16 bytes; average: 373us; min: 364us; max: 407us 32 bytes; average: 375us; min: 365us; max: 404us 64 bytes; average: 375us; min: 364us; max: 585us 128 bytes; average: 376us; min: 367us; max: 396us 256 bytes; average: 376us; min: 364us; max: 395us 512 bytes; average: 373us; min: 364us; max: 425us 1024 bytes; average: 371us; min: 365us; max: 421us 2048 bytes; average: 372us; min: 364us; max: 472us 4096 bytes; average: 372us; min: 365us; max: 411us 8192 bytes; average: 371us; min: 364us; max: 394us 16384 bytes; average: 371us; min: 364us; max: 403us 32768 bytes; average: 374us; min: 365us; max: 554us 65536 bytes; average: 372us; min: 364us; max: 407us 131072 bytes; average: 371us; min: 365us; max: 393us 262144 bytes; average: 372us; min: 364us; max: 394us 524288 bytes; average: 372us; min: 364us; max: 405us 1048576 bytes; average: 373us; min: 364us; max: 482us 2097152 bytes; average: 371us; min: 364us; max: 391us 4194304 bytes; average: 371us; min: 364us; max: 393us 8388608 bytes; average: 380us; min: 363us; max: 487us 16777216 bytes; average: 372us; min: 364us; max: 474us 33554432 bytes; average: 371us; min: 365us; max: 391us 67108864 bytes; average: 373us; min: 349us; max: 593us 134217728 bytes; average: 372us; min: 365us; max: 399us 268435456 bytes; average: 372us; min: 365us; max: 410us 536870912 bytes; average: 376us; min: 364us; max: 473us

但是现在,使用相同的GPU和相同的CUDA版本(8.0),我得到以下结果:

1 bytes; average: 136us; min: 127us; max: 367us 2 bytes; average: 133us; min: 127us; max: 147us 4 bytes; average: 134us; min: 128us; max: 155us 8 bytes; average: 133us; min: 128us; max: 153us 16 bytes; average: 133us; min: 128us; max: 149us 32 bytes; average: 132us; min: 128us; max: 145us 64 bytes; average: 133us; min: 128us; max: 153us 128 bytes; average: 143us; min: 132us; max: 371us 256 bytes; average: 138us; min: 133us; max: 170us 512 bytes; average: 138us; min: 133us; max: 157us 1024 bytes; average: 140us; min: 133us; max: 164us 2048 bytes; average: 141us; min: 133us; max: 273us 4096 bytes; average: 138us; min: 133us; max: 158us 8192 bytes; average: 138us; min: 132us; max: 155us 16384 bytes; average: 139us; min: 132us; max: 178us 32768 bytes; average: 139us; min: 133us; max: 156us 65536 bytes; average: 139us; min: 133us; max: 173us 131072 bytes; average: 138us; min: 132us; max: 157us 262144 bytes; average: 138us; min: 127us; max: 442us 524288 bytes; average: 134us; min: 127us; max: 279us 1048576 bytes; average: 134us; min: 127us; max: 264us 2097152 bytes; average: 227us; min: 144us; max: 239us 4194304 bytes; average: 424us; min: 214us; max: 436us 8388608 bytes; average: 819us; min: 409us; max: 849us 16777216 bytes; average: 1606us; min: 815us; max: 1625us 33554432 bytes; average: 3181us; min: 1610us; max: 3211us 67108864 bytes; average: 6377us; min: 3239us; max: 6423us 134217728 bytes; average: 12693us; min: 6421us; max: 12772us 268435456 bytes; average: 25333us; min: 12789us; max: 25593us 536870912 bytes; average: 50606us; min: 25512us; max: 50975us

我在我们实验室的其他GPU(GTX 780Tis和Titan Black)上也有类似的行为,所有这些都有RHEL 7.4和CUDA 8.我可以在分配大小和时间之间建立某种关系,但这些时代也似乎很荒谬。分配500MiB 50ms?我有一台运行Fedora 26的家用机器和带有NVIDIA GTX 770的CUDA 8,我得到以下结果:

1 bytes; average: 269us; min: 156us; max: 574us 2 bytes; average: 286us; min: 140us; max: 510us 4 bytes; average: 271us; min: 156us; max: 595us 8 bytes; average: 272us; min: 163us; max: 563us 16 bytes; average: 171us; min: 152us; max: 325us 32 bytes; average: 178us; min: 148us; max: 301us 64 bytes; average: 171us; min: 156us; max: 387us 128 bytes; average: 171us; min: 150us; max: 315us 256 bytes; average: 163us; min: 150us; max: 470us 512 bytes; average: 175us; min: 148us; max: 350us 1024 bytes; average: 173us; min: 155us; max: 471us 2048 bytes; average: 172us; min: 151us; max: 286us 4096 bytes; average: 177us; min: 148us; max: 401us 8192 bytes; average: 188us; min: 156us; max: 527us 16384 bytes; average: 177us; min: 147us; max: 407us 32768 bytes; average: 167us; min: 151us; max: 506us 65536 bytes; average: 174us; min: 145us; max: 294us 131072 bytes; average: 166us; min: 150us; max: 406us 262144 bytes; average: 173us; min: 163us; max: 276us 524288 bytes; average: 172us; min: 152us; max: 431us 1048576 bytes; average: 180us; min: 150us; max: 423us 2097152 bytes; average: 170us; min: 150us; max: 391us 4194304 bytes; average: 171us; min: 162us; max: 238us 8388608 bytes; average: 182us; min: 157us; max: 420us 16777216 bytes; average: 167us; min: 159us; max: 225us 33554432 bytes; average: 188us; min: 164us; max: 539us 67108864 bytes; average: 200us; min: 180us; max: 403us 134217728 bytes; average: 269us; min: 234us; max: 478us 268435456 bytes; average: 333us; min: 300us; max: 610us 536870912 bytes; average: 495us; min: 455us; max: 719us

这对我来说似乎更合理。行为的改变与我所知道的任何软件更新都不一致,但确实与所有重新启动的机器一致。一个或许重要的区别是我的家用机器运行内核版本4.12,而实验室中的慢速机器运行3.10。

编辑:我的家用机器(仍然很快)运行nvidia驱动程序的版本375.66,而实验室机器运行384.66。

0 个答案:

没有答案