我遇到了一个我无法解释的有趣现象。我没有在网上找到答案,因为大多数帖子都涉及弱缩放和通信开销。
这是一小段代码来说明问题。这是用不同语言测试的,结果相似,因此有多个标签。
LD R1, a ;move the value of a into R1
LD R2, b ;move the value of b into R2
NOT R1, R1
NOT R2, R2
AND R3, R1, R2
NOT R3, R3
HALT
现在你可以看到,这里定时的唯一部分是循环。因此,使用类似的CPU,没有超线程和足够的RAM,增加CPU的数量应该产生完全相同的时间。
然而,在我的机器上,这是32核15GiB RAM,
#include <mpi.h>
#include <stdio.h>
#include <time.h>
int main() {
MPI_Init(NULL,NULL);
int wsize;
MPI_Comm_size(MPI_COMM_WORLD, &wsize);
int wrank;
MPI_Comm_rank(MPI_COMM_WORLD, &wrank);
clock_t t;
MPI_Barrier(MPI_COMM_WORLD);
t=clock();
int imax = 10000000;
int jmax = 1000;
for (int i=0; i<imax; i++) {
for (int j=0; j<jmax; j++) {
//nothing
}
}
t=clock()-t;
printf( " proc %d took %f seconds.\n", wrank,(float)t/CLOCKS_PER_SEC );
MPI_Finalize();
return 0;
}
给出
mpirun -np 1 ./test
但是
proc 0 took 22.262777 seconds.
给出
mpirun -np 20 ./test
和不同数量的CPU之间的值。
htop还显示RAM消耗增加(1个核心的VIRT约为100M,20个核心约为300M)。虽然这可能与mpi通信器的大小有关?
最后,它肯定与问题的大小有关(因此,无论循环的大小如何,都不会导致持续延迟的通信开销)。实际上,将imax减少到10 000会使得壁垒时间相似。
1核心:
proc 18 took 24.440767 seconds.
proc 0 took 24.454365 seconds.
proc 4 took 24.461191 seconds.
proc 15 took 24.467632 seconds.
proc 14 took 24.469728 seconds.
proc 7 took 24.469809 seconds.
proc 5 took 24.461639 seconds.
proc 11 took 24.484224 seconds.
proc 9 took 24.491638 seconds.
proc 2 took 24.484953 seconds.
proc 17 took 24.490984 seconds.
proc 16 took 24.502146 seconds.
proc 3 took 24.513380 seconds.
proc 1 took 24.541555 seconds.
proc 8 took 24.539808 seconds.
proc 13 took 24.540005 seconds.
proc 12 took 24.556068 seconds.
proc 10 took 24.528328 seconds.
proc 19 took 24.585297 seconds.
proc 6 took 24.611254 seconds.
20核:
proc 0 took 0.028439 seconds.
已经在几台机器上尝试了类似的结果。 也许我们错过了一些非常简单的事情。
感谢您的帮助!
答案 0 :(得分:2)
具有受温度限制的turbo频率的处理器。
现代处理器受热设计功率(TDP)的限制。每当处理器处于冷态时,单核可能加速到turbo倍频器。当热或多个非空闲核心时,核心速度会降低到保证的基本速度。基本速度和涡轮速度之间的差异通常约为400MHz。即使低于基本速度,AVX或FMA3也可能会减速。