我正在编写一些代码,这些代码在计算上很昂贵,但是高度可并行化。一旦实现并行化,我打算在HPC上运行它,但是为了将运行时间缩短到一周以内,该问题需要随着处理器数量的增加而得到很好的扩展。
下面是我要实现的目标的一个简单而荒谬的示例,它足够简洁,可以编译并演示我的问题;
CardView
我已经使用四核笔记本电脑对问题进行了编译
#include <iostream>
#include <ctime>
#include "mpi.h"
using namespace std;
double int_theta(double E){
double result = 0;
for (int k = 0; k < 20000; k++)
result += E*k;
return result;
}
int main()
{
int n = 3500000;
int counter = 0;
time_t timer;
int start_time = time(&timer);
int myid, numprocs;
int k;
double integrate, result;
double end = 0.5;
double start = -2.;
double E;
double factor = (end - start)/(n*1.);
integrate = 0;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
for (k = myid; k<n+1; k+=numprocs){
E = start + k*(end-start)/n;
if (( k == 0 ) || (k == n))
integrate += 0.5*factor*int_theta(E);
else
integrate += factor*int_theta(E);
counter++;
}
cout<<"process "<<myid<<" took "<<time(&timer)-start_time<<"s"<<endl;
cout<<"process "<<myid<<" performed "<<counter<<" computations"<<endl;
MPI_Reduce(&integrate, &result, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0)
cout<<result<<endl;
MPI_Finalize();
return 0;
}
我得到以下输出;
mpiicc test.cpp -std=c++14 -O3 -DMKL_LP64 -lmkl_intel_lp64 - lmkl_sequential -lmkl_core -lpthread -lm -ldl
在我看来,我不知道的地方一定存在障碍。 2个处理器比3个处理器能获得更好的性能。请提供任何建议吗?谢谢
答案 0 :(得分:0)
If I read the output of lscpu
you gave correctly (e.g. with the help of https://unix.stackexchange.com/a/218081), you are having 4 logical CPUs, but only 2 hardware cores (1 socket x 2 cores per socket).
Using cat /proc/cpuinfo
you can finde the make and model for the CPU to maybe find out more.
The four logical CPUs might result from hyperthreading, which means that some hardware resources (e.g. the FPU unit, but I am not an expert on this) are shared between two cores. Thus, I would not expect any good parallel scaling beyond two processes.
For scalability tests, you should try to get your hands on a machine with maybe 6 or more hardware cores do get a better estimate.
From looking at your code, I would expect perfect scalability to any number of cores - At least as long as you do not include the time needed for process startup and the final MPI_Reduce. These will for sure become slower with more processes involved.