将矩阵与MPI相乘(分段错误)

时间:2016-05-27 08:00:23

标签: matrix segmentation-fault mpi

我最近开始学习MPI,你可能已经猜到了,我已经遇到了一个我自己无法解决的错误!

我想编写一个将两个矩阵相乘的程序。然而,我还没有那么远,事实上,我一开始就坚持广播矩阵。

#define MASTER 0
if (rank == MASTER) {
        A = (double *) malloc(N * N * sizeof(double));
        B = (double *) malloc(N * N * sizeof(double));
        matFillRand(N, A);
        matFillRand(N, B);
    }

if (rank == MASTER) {
        P = (double *) malloc(N * N * sizeof(double));
    }

matMulMPI(N, A, B, P);

if (rank == MASTER) {
        printMatrix(N, P);
}

(理论上)数学运算的函数如下:

void matMulMPI(long N, double *a, double *b, double *c) {
    long i, j, k;   
    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    MPI_Bcast(&N, 1, MPI_LONG, MASTER, MPI_COMM_WORLD);

    MPI_Bcast(b, N*N, MPI_DOUBLE, MASTER, MPI_COMM_WORLD);

    printMatrix(N, b);

    //TO-DO: Broadcast A
    //TO-DO: Do Math
}

这个广播不起作用。我收到以下消息:

  

*处理接收信号 信号:分段错误(11)信号代码:无效权限(2)地址失败:0x401560信号:   分段错误(11)信号代码:无效权限(2)失败   地址:0x401560 [0]   /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7fc3ede6b340] [1]   /lib/x86_64-linux-gnu/libc.so.6(+0x981c0)[0x7fc3edb2e1c0] [2]   /usr/lib/libmpi.so.1(opal_convertor_unpack+0x105)[0x7fc3ee1788d5] [   3]   /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x460)   [0x7fc3e6587630] [4]   /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x487)   [0x7fc3e572a137] [5] /usr/lib/libmpi.so.1(opal_progress+0x5a)   [0x7fc3ee1849ea] [6]   /usr/lib/libmpi.so.1(ompi_request_default_wait+0x16d)[0x7fc3ee0d1c0d]   [7]   /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_generic+0x49e)   [0x7fc3e486da9e] [8]   /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_binomial+0xb7)   [0x7fc3e486df27] [9]   /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0xcc)   [0x7fc3e486573c] [10]   /usr/lib/openmpi/lib/openmpi/mca_coll_sync.so(mca_coll_sync_bcast+0x64)   [0x7fc3e4a7d6a4] [11] /usr/lib/libmpi.so.1(MPI_Bcast+0x13d)   [0x7fc3ee0df78d] [12] ./matMul()[0x4011a9] [13] ./matMul()[0x401458]   [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)   [0x7fc3edab7ec5] [15] ./matMul()[0x400b49]    错误消息 [0] /lib/x86_64-linux-gnu/libpthread.so.0( + 0x10340)[0x7fa4a1fe5340] [1]   /lib/x86_64-linux-gnu/libc.so.6(+0x981c0)[0x7fa4a1ca81c0] [2]   /usr/lib/libmpi.so.1(opal_convertor_unpack+0x105)[0x7fa4a22f28d5] [   3]   /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x460)   [0x7fa49a701630] [4]   /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x487)   [0x7fa4998a4137] [5] /usr/lib/libmpi.so.1(opal_progress+0x5a)   [0x7fa4a22fe9ea] [6]   /usr/lib/libmpi.so.1(ompi_request_default_wait+0x16d)[0x7fa4a224bc0d]   [7]   /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_generic+0x4e0)   [0x7fa4989e7ae0] [8]   /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_binomial+0xb7)   [0x7fa4989e7f27] [9]   /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0xcc)   [0x7fa4989df73c] [10]   /usr/lib/openmpi/lib/openmpi/mca_coll_sync.so(mca_coll_sync_bcast+0x64)   [0x7fa498bf76a4] [11] /usr/lib/libmpi.so.1(MPI_Bcast+0x13d)   [0x7fa4a225978d] [12] ./matMul()[0x4011a9] [13] ./matMul()[0x401458]   [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)   [0x7fa4a1c31ec5] [15] ./matMul()[0x400b49]    错误消息结束*   -------------------------------------------------- ------------------------ mpirun注意到进程在节点上具有PID 12466的等级2   rtidev5.etf.bg.ac.rs退出信号11(分段故障)。   -------------------------------------------------- ------------------------ 2个进程被杀死(有些可能是在清理过程中被mpirun攻击)

1 个答案:

答案 0 :(得分:2)

我已经弄清楚了。所有进程(不仅是主进程)都需要先分配内存。

所以缺少的一行是

void matMulMPI(long N, double *a, double *b, double *c) {
    ...

    MPI_Bcast(&N, 1, MPI_LONG, MASTER, MPI_COMM_WORLD);

    b = (double *) malloc(N * N * sizeof(double));

    MPI_Bcast(b, N*N, MPI_DOUBLE, MASTER, MPI_COMM_WORLD);

    ...
}