最近,在重新启动我们实验室的GPU计算节点后,通过OpenCL API(带clCreateBuffer()
)分配内存的成本似乎已经上升。我以前观察到分配时间对于分配的字节数几乎是不变的,但突然之间似乎存在强相关性。例如,使用http://lpaste.net/raw/356490上的基准程序,我使用在NVIDIA K40 GPU上获得以下结果:
1 bytes; average: 382us; min: 364us; max: 501us
2 bytes; average: 376us; min: 364us; max: 478us
4 bytes; average: 372us; min: 364us; max: 474us
8 bytes; average: 375us; min: 364us; max: 405us
16 bytes; average: 373us; min: 364us; max: 407us
32 bytes; average: 375us; min: 365us; max: 404us
64 bytes; average: 375us; min: 364us; max: 585us
128 bytes; average: 376us; min: 367us; max: 396us
256 bytes; average: 376us; min: 364us; max: 395us
512 bytes; average: 373us; min: 364us; max: 425us
1024 bytes; average: 371us; min: 365us; max: 421us
2048 bytes; average: 372us; min: 364us; max: 472us
4096 bytes; average: 372us; min: 365us; max: 411us
8192 bytes; average: 371us; min: 364us; max: 394us
16384 bytes; average: 371us; min: 364us; max: 403us
32768 bytes; average: 374us; min: 365us; max: 554us
65536 bytes; average: 372us; min: 364us; max: 407us
131072 bytes; average: 371us; min: 365us; max: 393us
262144 bytes; average: 372us; min: 364us; max: 394us
524288 bytes; average: 372us; min: 364us; max: 405us
1048576 bytes; average: 373us; min: 364us; max: 482us
2097152 bytes; average: 371us; min: 364us; max: 391us
4194304 bytes; average: 371us; min: 364us; max: 393us
8388608 bytes; average: 380us; min: 363us; max: 487us
16777216 bytes; average: 372us; min: 364us; max: 474us
33554432 bytes; average: 371us; min: 365us; max: 391us
67108864 bytes; average: 373us; min: 349us; max: 593us
134217728 bytes; average: 372us; min: 365us; max: 399us
268435456 bytes; average: 372us; min: 365us; max: 410us
536870912 bytes; average: 376us; min: 364us; max: 473us
但是现在,使用相同的GPU和相同的CUDA版本(8.0),我得到以下结果:
1 bytes; average: 136us; min: 127us; max: 367us
2 bytes; average: 133us; min: 127us; max: 147us
4 bytes; average: 134us; min: 128us; max: 155us
8 bytes; average: 133us; min: 128us; max: 153us
16 bytes; average: 133us; min: 128us; max: 149us
32 bytes; average: 132us; min: 128us; max: 145us
64 bytes; average: 133us; min: 128us; max: 153us
128 bytes; average: 143us; min: 132us; max: 371us
256 bytes; average: 138us; min: 133us; max: 170us
512 bytes; average: 138us; min: 133us; max: 157us
1024 bytes; average: 140us; min: 133us; max: 164us
2048 bytes; average: 141us; min: 133us; max: 273us
4096 bytes; average: 138us; min: 133us; max: 158us
8192 bytes; average: 138us; min: 132us; max: 155us
16384 bytes; average: 139us; min: 132us; max: 178us
32768 bytes; average: 139us; min: 133us; max: 156us
65536 bytes; average: 139us; min: 133us; max: 173us
131072 bytes; average: 138us; min: 132us; max: 157us
262144 bytes; average: 138us; min: 127us; max: 442us
524288 bytes; average: 134us; min: 127us; max: 279us
1048576 bytes; average: 134us; min: 127us; max: 264us
2097152 bytes; average: 227us; min: 144us; max: 239us
4194304 bytes; average: 424us; min: 214us; max: 436us
8388608 bytes; average: 819us; min: 409us; max: 849us
16777216 bytes; average: 1606us; min: 815us; max: 1625us
33554432 bytes; average: 3181us; min: 1610us; max: 3211us
67108864 bytes; average: 6377us; min: 3239us; max: 6423us
134217728 bytes; average: 12693us; min: 6421us; max: 12772us
268435456 bytes; average: 25333us; min: 12789us; max: 25593us
536870912 bytes; average: 50606us; min: 25512us; max: 50975us
我在我们实验室的其他GPU(GTX 780Tis和Titan Black)上也有类似的行为,所有这些都有RHEL 7.4和CUDA 8.我可以在分配大小和时间之间建立某种关系,但这些时代也似乎很荒谬。分配500MiB 50ms?我有一台运行Fedora 26的家用机器和带有NVIDIA GTX 770的CUDA 8,我得到以下结果:
1 bytes; average: 269us; min: 156us; max: 574us
2 bytes; average: 286us; min: 140us; max: 510us
4 bytes; average: 271us; min: 156us; max: 595us
8 bytes; average: 272us; min: 163us; max: 563us
16 bytes; average: 171us; min: 152us; max: 325us
32 bytes; average: 178us; min: 148us; max: 301us
64 bytes; average: 171us; min: 156us; max: 387us
128 bytes; average: 171us; min: 150us; max: 315us
256 bytes; average: 163us; min: 150us; max: 470us
512 bytes; average: 175us; min: 148us; max: 350us
1024 bytes; average: 173us; min: 155us; max: 471us
2048 bytes; average: 172us; min: 151us; max: 286us
4096 bytes; average: 177us; min: 148us; max: 401us
8192 bytes; average: 188us; min: 156us; max: 527us
16384 bytes; average: 177us; min: 147us; max: 407us
32768 bytes; average: 167us; min: 151us; max: 506us
65536 bytes; average: 174us; min: 145us; max: 294us
131072 bytes; average: 166us; min: 150us; max: 406us
262144 bytes; average: 173us; min: 163us; max: 276us
524288 bytes; average: 172us; min: 152us; max: 431us
1048576 bytes; average: 180us; min: 150us; max: 423us
2097152 bytes; average: 170us; min: 150us; max: 391us
4194304 bytes; average: 171us; min: 162us; max: 238us
8388608 bytes; average: 182us; min: 157us; max: 420us
16777216 bytes; average: 167us; min: 159us; max: 225us
33554432 bytes; average: 188us; min: 164us; max: 539us
67108864 bytes; average: 200us; min: 180us; max: 403us
134217728 bytes; average: 269us; min: 234us; max: 478us
268435456 bytes; average: 333us; min: 300us; max: 610us
536870912 bytes; average: 495us; min: 455us; max: 719us
这对我来说似乎更合理。行为的改变与我所知道的任何软件更新都不一致,但确实与所有重新启动的机器一致。一个或许重要的区别是我的家用机器运行内核版本4.12,而实验室中的慢速机器运行3.10。
编辑:我的家用机器(仍然很快)运行nvidia
驱动程序的版本375.66,而实验室机器运行384.66。