为什么mpi消息传递似乎加快了?

时间:2017-06-19 17:30:50

标签: mpi benchmarking

我正在学习MPI编程,我做了一个简单的程序来回传递消息。在消息中我记录了发送和接收时间(以纳秒为单位),我注意到一些奇怪的事情:它发送/接收消息的前几次,有大量的延迟(几十微秒),尽管有更多的发送/收到,那个延迟消失了,只变成1-2微秒。 为什么会这样?

我的程序在具有四个核心的计算机上运行,​​我用其中两个来调用该程序。我已经创建了一个最小的示例来演示:

vector<size_t> times;
times.reserve(100);
stopwatch s;//Records time since initialization of value
int counter = 0;
if(mpi.world_rank == 0)
{
    //Do this if you're on thread 0
    for(int i=0;i<20;++i)
    {
        ++counter;
        times.push_back(s.age_nano());
        //Send counter (size of 1) to thread 1 with tag 0
        mpi.send(&counter, 1, 1, 0);
        //Receive value (size of 1) from thread 1 with tag 0
        mpi.receive(&counter, 1, 1, 0);
    }
}
else if(mpi.world_rank == 1)
{
    //Otherwise do this if you're on thread 1
    for(int i=0;i<20;++i)
    {
        //Receive value (size of 1) from thread 0 with tag 0
        mpi.receive(&counter, 1, 0, 0);
        ++counter;
        times.push_back(s.age_nano());
        //Send counter (size of 1) to thread 0 with tag 0
        mpi.send(&counter, 1, 0, 0);
    }
}
for(int i=times.size(); i > 0; --i) times[i] -= times[i-1];
cout << times << " Counter: " << counter << endl;

当我运行程序时,我得到以下输出:

[Code]$ mpic++ main.cc && mpirun -n 2 a.out
{116, 32276, 1288, 665, 674, 633, 662, 661, 570, 651, 560, 564, 610, 602, 635, 636, 13511, 3080, 449, 473} Counter: 40
{23839, 9402, 908, 662, 668, 651, 652, 592, 635, 586, 593, 575, 632, 612, 632, 7120, 8585, 1435, 442, 450} Counter: 40

如果你注意到,前几个值中的一些值比其他值高很多,其中大部分值在500到700纳秒之间。 mpi.sendmpi.receive函数只是MPI_SendMPI_Recv等更标准函数的一个非常轻量级的包装器。这是stopwatch类的代码:

struct stopwatch
{
typedef decltype(std::chrono::high_resolution_clock::now()) time;
typedef std::chrono::duration<double, std::ratio<1,1>> seconds;
typedef std::chrono::duration<double, std::milli> milliseconds;
typedef std::chrono::duration<double, std::micro> microseconds;
typedef std::chrono::duration<double, std::nano> nanoseconds;
time _start = std::chrono::high_resolution_clock::now();
auto age_nano()
{
    return (std::chrono::high_resolution_clock::now() - _start).count();
}
double age_micro()
{
    return microseconds(std::chrono::high_resolution_clock::now() - _start).count();
}
double age_milli()
{
    return milliseconds(std::chrono::high_resolution_clock::now() - _start).count();
}
double age()
{
    return seconds(std::chrono::high_resolution_clock::now() - _start).count();
}
void reset() { _start = std::chrono::high_resolution_clock::now(); }
};

这是我围绕mpi构建的包装器的代码:

#include <mpi.h>
#include <vector>
#include <string>
template<class...> struct get_mpi_type{};
template<class T> struct get_mpi_type<const T>      { static constexpr auto type() { return get_mpi_type<T>::type(); } };
template<> struct get_mpi_type<short>               { static constexpr auto type() { return MPI_SHORT; }; };
template<> struct get_mpi_type<int>                 { static constexpr auto type() { return MPI_INT; }; };
template<> struct get_mpi_type<long int>            { static constexpr auto type() { return MPI_LONG; }; };
template<> struct get_mpi_type<long long int>       { static constexpr auto type() { return MPI_LONG_LONG; }; };
template<> struct get_mpi_type<unsigned char>       { static constexpr auto type() { return MPI_UNSIGNED_CHAR; }; };
template<> struct get_mpi_type<unsigned short>      { static constexpr auto type() { return MPI_UNSIGNED_SHORT; }; };
template<> struct get_mpi_type<unsigned int>        { static constexpr auto type() { return MPI_UNSIGNED; }; };
template<> struct get_mpi_type<unsigned long int>   { static constexpr auto type() { return MPI_UNSIGNED_LONG; }; };
template<> struct get_mpi_type<unsigned long long int> { static constexpr auto type() { return MPI_UNSIGNED_LONG_LONG; }; };
template<> struct get_mpi_type<float>               { static constexpr auto type() { return MPI_FLOAT; }; };
template<> struct get_mpi_type<double>              { static constexpr auto type() { return MPI_DOUBLE; }; };
template<> struct get_mpi_type<long double>             { static constexpr auto type() { return MPI_LONG_DOUBLE; }; };
template<> struct get_mpi_type<char>                { static constexpr auto type() { return MPI_BYTE; }; };
struct mpi_thread
{
int world_rank;
int world_size;
mpi_thread()
{
    MPI_Init(NULL, NULL);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
}
~mpi_thread()
{
    MPI_Finalize();
}
template<class T> void send(const T* data, int count, int destination, int tag)
{
    MPI_Send(data, count, get_mpi_type<T>::type(), destination, tag, MPI_COMM_WORLD);
}
template<class T> void send(const std::vector<T>& data, int destination, int tag)
{
    send(data.data(), data.size(), destination, tag);
}
template<class T> void send(const std::basic_string<T>& str, int destination, int tag)
{
    send(str.data(), str.size(), destination, tag);
}
MPI_Status probe(int source, int tag)
{
    MPI_Status status;
    MPI_Probe(source, tag, MPI_COMM_WORLD, &status);
    return status;
}
template<class T> int get_msg_size(MPI_Status& status)
{
    int num_amnt;
    MPI_Get_count(&status, get_mpi_type<T>::type(), &num_amnt);
    return num_amnt;
}

template<class T> void receive(T* data, int count, int source, int tag, MPI_Status& status = *MPI_STATUS_IGNORE)
{
    MPI_Recv(data, count, get_mpi_type<T>::type(), source, tag, MPI_COMM_WORLD, &status);
}
template<class T> void receive(std::vector<T>& dest, int source, int tag)
{
    MPI_Status status = probe(source, tag);
    int size = get_msg_size<T>(status);
    dest.clear();
    dest.resize(size);
    receive(&dest[0], size, source, tag, status);
}
template<class T> void receive(std::basic_string<T>& dest, int source, int tag)
{
    MPI_Status status = probe(source, tag);
    int size = get_msg_size<T>(status);
    dest.clear();
    dest.resize(size);
    receive(&dest[0], size, source, tag, status);
}
} mpi;

此外,我重载了ostream <<运算符以打印出向量,但这非常基本。

1 个答案:

答案 0 :(得分:1)

如果您想要对MPI进行基准测试,您应该使用众所周知的基准测试,例如(俄亥俄州立大学)OSU基准测试或英特尔IMB。

某些MPI库建立连接&#34; on demand&#34;,这意味着第一次将消息发送到对等体时,需要额外的开销来建立连接。第一次发送给定内存区域时可能会出现一些开销(内存必须注册,并且需要付费)。

众所周知的基准测试通常会在进行实际测量之前进行一些预热迭代,以便从结果中隐藏一次性延迟。