我试图对MPI程序的性能进行分析,以便在我的大学CRAY上添加两个向量,但是当我将其设置为在96个处理器上运行时,我发现MPI_Reduce
的性能有一个奇怪的结果。 CRAY上的每个节点都有24个核心。
以下是代码:
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define SIZE 331776 // (24^4) elements in each vector
double get_wall_time(){
struct timeval time;
gettimeofday(&time,NULL);
return (double)time.tv_sec + (double)time.tv_usec/1000000;
}
int main (int argc, char *argv[])
{
int source, numtasks;
int rank;
double time_begin,time_end,totalTime;
long int i,j;
double reduceTime,productTime;
float a[SIZE],b[SIZE]; // declaring vectors to be multiplied
float dot_product = 0; // final result of dot product
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
if(SIZE % numtasks == 0){
long int loc_blockSize = SIZE/numtasks; // block size for local buffers
float loc_buff_a[loc_blockSize]; // local buffer a for each processor
float loc_buff_b[loc_blockSize]; // local buffer b for each processor
float loc_dot[loc_blockSize]; // number of local dot products
source = 0; // process 0 where all the data exists
float result[loc_blockSize]; // results on the process 0
if(rank == source){
for(i=0;i<SIZE;i++){ // assigning random values to both a and b vector
a[i] = i;
b[i] = i;
}
}
MPI_Barrier(MPI_COMM_WORLD); // wait till process 0 assigns value to a and b
/* calculating scatter time*/
time_begin = get_wall_time();
MPI_Scatter(&a,loc_blockSize,MPI_FLOAT,&loc_buff_a,loc_blockSize,MPI_FLOAT,source,MPI_COMM_WORLD);
MPI_Scatter(&b,loc_blockSize,MPI_FLOAT,&loc_buff_b,loc_blockSize,MPI_FLOAT,source,MPI_COMM_WORLD);
time_end = get_wall_time();
totalTime = time_end-time_begin;
printf("rank = %d and scatter time:\n%f\n",rank,totalTime);
//scatter time ends----------------------------
/* calculating product time*/
time_begin = get_wall_time();
for(i=0; i<loc_blockSize; i++)
{
loc_dot[i]= loc_buff_a[i] * loc_buff_b[i];
/* //remove this commment to see the result of dot product at each processor
printf("rank= %d Local values:a[%d]= %f b[%d]=%f Dot: %f \n",rank,i,loc_buff_a[i],i,loc_buff_b[i],loc_dot[i]);
*/
}
time_end = get_wall_time();
totalTime = time_end-time_begin;
printf("rank = %d and product time:\n%f\n",rank,totalTime);
//product time ends----------------------------
/* calculating reduce time*/
time_begin = get_wall_time();
MPI_Reduce(&loc_dot,&result,loc_blockSize,MPI_FLOAT,MPI_SUM,source,MPI_COMM_WORLD);
if(rank == 0){
for(i=0;i<loc_blockSize;i++){
dot_product = dot_product + result[i];
}
printf("The result is: %f\n",dot_product);
}
time_end = get_wall_time();
totalTime = time_end-time_begin;
printf("rank = %d and reduction time:\n%f\n",rank,totalTime);
//reduce time ends--------------------
}// check-divisibility ends---------------
else {
printf("The vector size(=%d) is not divisible by number of %d processors\n",SIZE,numtasks);
}
MPI_Finalize();
}
使用不同数量的处理器运行它的结果:
答案 0 :(得分:1)
在为我的研究项目工作Cray时,我遇到了类似的性能问题,我和我的顾问教授进行了讨论。当使用超过特定数量的节点时,这种结果的原因之一可能是优化。这种优化可以产生比2或3个节点更好的性能,尽管它应该是最差的。
尝试使用不同数量的节点(即将过程增加到24个中的多个)超过四个节点并检查结果。
你可以试试mpi_barrier,但它不会改变任何东西,因为只有在完成计算点积的所有过程后才会打印结果。