Question

我试图对MPI程序的性能进行分析，以便在我的大学CRAY上添加两个向量，但是当我将其设置为在96个处理器上运行时，我发现MPI_Reduce的性能有一个奇怪的结果。 CRAY上的每个节点都有24个核心。

以下是代码：

#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define SIZE 331776 // (24^4) elements in each vector


double get_wall_time(){

        struct timeval time;
        gettimeofday(&time,NULL);
        return (double)time.tv_sec + (double)time.tv_usec/1000000;
}


int main (int argc, char *argv[])
{
int source, numtasks;
int rank;

double time_begin,time_end,totalTime;
long int i,j;
double reduceTime,productTime; 
float a[SIZE],b[SIZE]; // declaring vectors to be multiplied
float dot_product = 0; // final result of dot product

MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

if(SIZE % numtasks  == 0){
        long int loc_blockSize = SIZE/numtasks; // block size for local buffers
        float loc_buff_a[loc_blockSize];   // local buffer a for each processor 
        float loc_buff_b[loc_blockSize];   // local buffer b for each processor
        float loc_dot[loc_blockSize];      // number of local dot products 
        source = 0;            // process 0 where all the data exists

        float result[loc_blockSize];       // results on the process 0
        if(rank == source){
            for(i=0;i<SIZE;i++){    // assigning random values to both a and b vector
                a[i] = i;
                b[i] = i;               
            }
        }   

        MPI_Barrier(MPI_COMM_WORLD); // wait till process 0 assigns value to a and b


        /* calculating scatter time*/
        time_begin = get_wall_time(); 
        MPI_Scatter(&a,loc_blockSize,MPI_FLOAT,&loc_buff_a,loc_blockSize,MPI_FLOAT,source,MPI_COMM_WORLD);  
        MPI_Scatter(&b,loc_blockSize,MPI_FLOAT,&loc_buff_b,loc_blockSize,MPI_FLOAT,source,MPI_COMM_WORLD); 
        time_end = get_wall_time(); 

        totalTime = time_end-time_begin;
        printf("rank = %d and scatter time:\n%f\n",rank,totalTime);
        //scatter time ends----------------------------

        /* calculating product time*/
        time_begin = get_wall_time(); 

        for(i=0; i<loc_blockSize; i++)
        {
            loc_dot[i]= loc_buff_a[i] * loc_buff_b[i];

        /* //remove this commment to see the result of dot product at each processor
        printf("rank= %d Local values:a[%d]= %f b[%d]=%f Dot: %f  \n",rank,i,loc_buff_a[i],i,loc_buff_b[i],loc_dot[i]);
        */

        }

        time_end  = get_wall_time(); 
        totalTime = time_end-time_begin;
        printf("rank = %d and product time:\n%f\n",rank,totalTime);
        //product time ends----------------------------

        /* calculating reduce time*/
        time_begin = get_wall_time(); 


        MPI_Reduce(&loc_dot,&result,loc_blockSize,MPI_FLOAT,MPI_SUM,source,MPI_COMM_WORLD); 

        if(rank == 0){  
            for(i=0;i<loc_blockSize;i++){
                dot_product = dot_product + result[i];
            }
            printf("The result is: %f\n",dot_product);
        }

        time_end  = get_wall_time(); 
        totalTime = time_end-time_begin;
        printf("rank = %d and reduction time:\n%f\n",rank,totalTime);
        //reduce time ends--------------------

}// check-divisibility ends---------------

  else {
      printf("The vector size(=%d) is not divisible by number of %d processors\n",SIZE,numtasks);
  }
MPI_Finalize();

}

使用不同数量的处理器运行它的结果：

enter image description here

Answer 1

在为我的研究项目工作Cray时，我遇到了类似的性能问题，我和我的顾问教授进行了讨论。当使用超过特定数量的节点时，这种结果的原因之一可能是优化。这种优化可以产生比2或3个节点更好的性能，尽管它应该是最差的。

尝试使用不同数量的节点（即将过程增加到24个中的多个）超过四个节点并检查结果。

你可以试试mpi_barrier，但它不会改变任何东西，因为只有在完成计算点积的所有过程后才会打印结果。

当我增加Cray上的节点数时，有人可以解释MPI_reduce性能的奇怪增加

1 个答案: