越来越多的CPU会降低性能,CPU负载不变且无通信

时间:2017-08-31 14:01:58

标签: c++ c fortran mpi

我遇到了一个我无法解释的有趣现象。我没有在网上找到答案,因为大多数帖子都涉及弱缩放和通信开销。

这是一小段代码来说明问题。这是用不同语言测试的,结果相似,因此有多个标签。

LD R1, a ;move the value of a into R1
LD R2, b ;move the value of b into R2
NOT R1, R1
NOT R2, R2
AND R3, R1, R2
NOT R3, R3

HALT

现在你可以看到,这里定时的唯一部分是循环。因此,使用类似的CPU,没有超线程和足够的RAM,增加CPU的数量应该产生完全相同的时间。

然而,在我的机器上,这是32核15GiB RAM,

#include <mpi.h>
#include <stdio.h>
#include <time.h>

int main() {

    MPI_Init(NULL,NULL);

    int wsize;
    MPI_Comm_size(MPI_COMM_WORLD, &wsize);

    int wrank;
    MPI_Comm_rank(MPI_COMM_WORLD, &wrank);


    clock_t t;

    MPI_Barrier(MPI_COMM_WORLD);

    t=clock();

    int imax = 10000000;
    int jmax = 1000;
    for (int i=0; i<imax; i++) {
        for (int j=0; j<jmax; j++) {
            //nothing
        }
    }

    t=clock()-t;

    printf( " proc %d took %f seconds.\n", wrank,(float)t/CLOCKS_PER_SEC );

    MPI_Finalize();

    return 0;

}

给出

mpirun -np 1 ./test 

但是

 proc 0 took 22.262777 seconds.

给出

mpirun -np 20 ./test

和不同数量的CPU之间的值。

htop还显示RAM消耗增加(1个核心的VIRT约为100M,20个核心约为300M)。虽然这可能与mpi通信器的大小有关?

最后,它肯定与问题的大小有关(因此,无论循环的大小如何,都不会导致持续延迟的通信开销)。实际上,将imax减少到10 000会使得壁垒时间相似。

1核心:

 proc 18 took 24.440767 seconds.
 proc 0 took 24.454365 seconds.
 proc 4 took 24.461191 seconds.
 proc 15 took 24.467632 seconds.
 proc 14 took 24.469728 seconds.
 proc 7 took 24.469809 seconds.
 proc 5 took 24.461639 seconds.
 proc 11 took 24.484224 seconds.
 proc 9 took 24.491638 seconds.
 proc 2 took 24.484953 seconds.
 proc 17 took 24.490984 seconds.
 proc 16 took 24.502146 seconds.
 proc 3 took 24.513380 seconds.
 proc 1 took 24.541555 seconds.
 proc 8 took 24.539808 seconds.
 proc 13 took 24.540005 seconds.
 proc 12 took 24.556068 seconds.
 proc 10 took 24.528328 seconds.
 proc 19 took 24.585297 seconds.
 proc 6 took 24.611254 seconds.

20核:

 proc 0 took 0.028439 seconds.

已经在几台机器上尝试了类似的结果。 也许我们错过了一些非常简单的事情。

感谢您的帮助!

1 个答案:

答案 0 :(得分:2)

具有受温度限制的turbo频率的处理器。

现代处理器受热设计功率(TDP)的限制。每当处理器处于冷态时,单核可能加速到turbo倍频器。当热或多个非空闲核心时,核心速度会降低到保证的基本速度。基本速度和涡轮速度之间的差异通常约为400MHz。即使低于基本速度,AVX或FMA3也可能会减速。