Question

考虑以下程序应该做一些愚蠢的double s：

#include <iostream>
#include <vector>

#include <mpi.h>

void add(void* invec, void* inoutvec, int* len, MPI_Datatype*)
{
    double* a = reinterpret_cast <double*> (inoutvec);
    double* b = reinterpret_cast <double*> (invec);

    for (int i = 0; i != *len; ++i)
    {
        a[i] += b[i];
    }
}

int main(int argc, char* argv[])
{
    MPI_Init(&argc, &argv);

    std::vector<double> buffer = { 2.0, 3.0 };

    MPI_Op operation;
    MPI_Op_create(add, 1, &operation);

    MPI_Datatype types[1];
    MPI_Aint addresses[1];
    int lengths[1];
    int count = 1;

    MPI_Get_address(buffer.data(), &addresses[0]);
    lengths[0] = buffer.size();
    types[0] = MPI_DOUBLE;

    MPI_Datatype type;
    MPI_Type_create_struct(count, lengths, addresses, types, &type);
    MPI_Type_commit(&type);

    MPI_Allreduce(MPI_IN_PLACE, MPI_BOTTOM, 1, type, operation, MPI_COMM_WORLD);

    MPI_Type_free(&type);
    MPI_Op_free(&operation);
    MPI_Finalize();

    std::cout << buffer[0] << " " << buffer[1] << "\n";
}

因为这是大型程序的一部分，我想要发送的数据是1）在堆上，2）由不同的类型组成，我认为我必须使用用户定义的类型。

现在出现问题一定是错误的，因为程序在使用mpirun -n 2 ./a.out运行时崩溃了。 gdb的回溯是：

#0  __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:158
#1  0x00007ffff65de460 in non_overlap_copy_content_same_ddt () from /usr/local/lib/libopen-pal.so.6
#2  0x00007ffff180a69b in ompi_coll_tuned_allreduce_intra_recursivedoubling () from /usr/local/lib/openmpi/mca_coll_tuned.so
#3  0x00007ffff793bb8b in PMPI_Allreduce () from /usr/local/lib/libmpi.so.1
#4  0x00000000004088b6 in main (argc=1, argv=0x7fffffffd708) at mpi_test.cpp:39

第39行是MPI_Allreduce来电。这可能是一个愚蠢的错误，但在盯着它好几个小时后，我仍然没有看到它。有没有人发现错误？谢谢！

Answer 1

编辑：在执行就地缩减时，Open MPI如何处理具有非零下限的类型（例如您在使用绝对地址时创建的类型）时出现错误所有。它似乎存在于所有版本中，包括开发分支。可以通过issue on GitHub跟踪状态。

您的add运算符错误，因为您未能考虑数据类型的下限。一个合适的解决方案是：

void add(void* invec, void* inoutvec, int* len, MPI_Datatype* datatype)
{
    MPI_Aint lb, extent;
    MPI_Type_get_true_extent(*datatype, &lb, &extent);

    double* a = reinterpret_cast <double*> (reinterpret_cast <char*>(inoutvec) + lb);
    double* b = reinterpret_cast <double*> (reinterpret_cast <char*>(invec) + lb);

    for (int i = 0; i != *len; ++i)
    {
        a[i] += b[i];
    }
}

这将正确访问数据，但仍然是错误的。 *len将为1，因为这是您传递给MPI_Allreduce的内容，但每个元素后面有两个双打。正确编写的运算符将使用类型内省机制来获取双精度块的长度并乘以*len或简单地将矢量长度硬编码为2：

for (int i = 0; i < 2*(*len); i++)
{
    a[i] += b[i];
}

Allreduce使用用户定义的函数和MPI_BOTTOM

1 个答案: