如何使用MPI_Win_allocate_shared而不会出现错误?

时间:2019-03-20 11:28:01

标签: c++ mpi shared-memory

我必须复制一种算法,其中我需要两个缓冲区(矩阵(L + 1)x(N + 2)),这些缓冲区必须在各个进程之间共享(每个进程都必须能够在其中写入并读取其他哪些进程)写道)。 我发现一个解决方案可以使用MPI_Win_allocate_shared,但是我认为我不太了解如何使用它,因为我遇到了错误。 我将在代码下面报告我认为接近解决方案的两个试验(避免使用整个算法来关注问题):

#include "Options.h"
#include <math.h>
#include <array>
#include <algorithm>
#include <memory>
#include <cmath>
#include <mpi.h>

std::pair <double, double> Options::BinomialPriceAmericanPut(void) {

int rank,size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

// shared buffers to save data for seller and buyer
MPI_Win win_seller, win_buyer;
// size of the local window in bytes
MPI_Aint buff_size;

///////////////// TRIAL 1 /////////////////////////
// pointers that will (locally) point to the shared memory
typedef std::array<PWL, N+2> row_type;
row_type *seller_buff;
row_type *buyer_buff;

///////////////// TRIAL 2 /////////////////////////
// pointers that will (locally) point to the shared memory
typedef std::array<PWL, N+2> row_type;
row_type seller_buff[L+1];
row_type buyer_buff[L+1];
// with this TRIAL 2 I'll remove "&"in front of seller_buff and buyer_buff 
// in MPI_Win_allocate_shared and MPI_Win_shared_query

// allocate shared memory
if (rank == 0) {
    buff_size = (N+2) * (L+1) * sizeof(PWL);
    MPI_Win_allocate_shared(buff_size, sizeof(PWL), MPI_INFO_NULL,
                          MPI_COMM_WORLD, &seller_buff, &win_seller);
    MPI_Win_allocate_shared(buff_size, sizeof(PWL), MPI_INFO_NULL,
                          MPI_COMM_WORLD, &buyer_buff, &win_buyer);
}
else {
    int disp_unit;
    MPI_Win_allocate_shared(0, sizeof(PWL), MPI_INFO_NULL,
                          MPI_COMM_WORLD, &seller_buff, &win_seller);
    MPI_Win_allocate_shared(0, sizeof(PWL), MPI_INFO_NULL,
                          MPI_COMM_WORLD, &buyer_buff, &win_buyer);
    MPI_Win_shared_query(win_seller, 0, &buff_size, &disp_unit, 
&seller_buff);
    MPI_Win_shared_query(win_buyer, 0, &buff_size, &disp_unit, 
&buyer_buff);
}

// up- and down- move factors
double u = exp( sigma * sqrt(expiry/N) );

// cash accumulation factor
double r = exp( R*expiry / N );

// initialize algorithm
int p(size);
int n = N + 2; // number of nodes in the current base level
int s = rank * ( n/p );
int e = (rank==p-1)? n: (rank+1) * ( n/p );
// each core works on e-s nodes in the current level

// compute u and z for both seller and buyer: payoff (0,0) at time N+1
for (int l=s; l<e; l++) {
    const double St = S0*pow (u, N+1-2*l);
    const double Sa = St * (1+k);
    const double Sb = St * (1-k);

    // compute functions
    PWL u_s( {Line(-Sa, 0), Line(-Sb,0)} );
    PWL u_b( {Line(-Sa, 0), Line(-Sb, 0)} );

    // fill buffers
    seller_buff[0][l] = u_s;
    buyer_buff[0][l] = u_b;
}

MPI_Barrier(MPI_COMM_WORLD);
if (rank == 0 ) {
    std::cout << "Row: " << 11 << std::endl
    << "\tAsk = " << seller_buff[0][7].valueInPoint(0) << std::endl
    << "\tBid = " << -buyer_buff[0][7].valueInPoint(0) << std::endl;
}

int U = 0; // variable for the mapping from tree to buffers
int B=N+1; // current base level
while ( B>0 ) {
  // DO stuffs with buffers
}

// compute ask and bid prices
double ask(0), bid(0);

// clear shared windows
MPI_Win_free(&win_seller);
MPI_Win_free(&win_buyer);

return std::make_pair(bid, ask);
}

我在MPI_Barrier之后添加了“ if”,以查看缓冲区是否起作用,其中第7列(N = 10)应该由等级1计算。 实际上,当使用另一个更简单的类时,TRIAL 1起作用了,但对于PWL类却不起作用。两次试验中的错误是:

1)在试用1中,由于在if中调用valueInPoint()而导致分段错误:问题是等级0看不到等级1在其列中写了什么,但我不明白为什么。

mpiexec -np 3 main
[localhost:09623] *** Process received signal ***
[localhost:09623] Signal: Segmentation fault (11)
[localhost:09623] Signal code: Address not mapped (1)
[localhost:09623] Failing at address: 0x26fd440
[localhost:09623] [ 0] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libpthread.so.0(+0x10e20)[0x7fa78a99de20]
[localhost:09623] [ 1] main[0x4048ac]
[localhost:09623] [ 2] main[0x4048f8]
[localhost:09623] [ 3] main[0x401e55]
[localhost:09623] [ 4] main[0x40178e]
[localhost:09623] [ 5] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(__libc_start_main+0xf0)[0x7fa78a60c6b0]
[localhost:09623] [ 6] main[0x401389]
[localhost:09623] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 9623 on node localhost exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
make: *** [Makefile:26: run] Error 139

2)在这种情况下,等级0可以访问和打印等级1所做的事情,但是我遇到另一个错误。

mpiexec -np 3 main
Row: 11
    Ask = 0
    Bid = -0
[localhost:09651] *** Process received signal ***
[localhost:09651] Signal: Segmentation fault (11)
[localhost:09651] Signal code: Address not mapped (1)
[localhost:09651] Failing at address: 0x7fe9777c90bc
[localhost:09651] [ 0] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libpthread.so.0(+0x10e20)[0x7fe976627e20]
[localhost:09651] [ 1] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(cfree+0x14)[0x7fe9762eef74]
[localhost:09651] [ 2] main[0x404724]
[localhost:09651] [ 3] main[0x404096]
[localhost:09651] [ 4] main[0x40362e]
[localhost:09651] [ 5] main[0x403127]
[localhost:09651] [ 6] main[0x40274f]
[localhost:09651] [ 7] main[0x402528]
[localhost:09651] [ 8] main[0x4025e6]
[localhost:09651] [ 9] main[0x402017]
[localhost:09651] [10] main[0x40178e]
[localhost:09651] [11] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(__libc_start_main+0xf0)[0x7fe9762966b0]
[localhost:09651] [12] main[0x401389]
[localhost:09651] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 9651 on node localhost exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
make: *** [Makefile:26: run] Error 139

此外,实际上,当我使用TRIAL 2运行all算法(不注释while循环)时,我遇到另一个错误:

mpiexec -np 3 main
Row: 11
    Ask = 0
    Bid = -0
*** Error in `main': free(): invalid pointer: 0x00007f2ccac660c4 ***
======= Backtrace: =========
/u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(+0x6f2e4)[0x7f2cc977f2e4]
/u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(+0x74d16)[0x7f2cc9784d16]
/u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(+0x754fe)[0x7f2cc97854fe]
main[0x405dd6]
main[0x40556c]
main[0x404930]
main[0x404429]
main[0x40396f]
main[0x404eb7]
main[0x404334]
main[0x403833]
main[0x4023ee]
main[0x40178e]
/u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(__libc_start_main+0xf0)[0x7f2cc97306b0]
main[0x401389]
======= Memory map: ========
00400000-0040e000 r-xp 00000000 00:25 659                                /vagrant/Google Drive/ACP/TENTATIVO4/main
0060d000-0060e000 r--p 0000d000 00:25 659                                /vagrant/Google Drive/ACP/TENTATIVO4/main
0060e000-0060f000 rw-p 0000e000 00:25 659                                /vagrant/Google Drive/ACP/TENTATIVO4/main
00fac000-01239000 rw-p 00000000 00:00 0                                  [heap]
7f2cb0000000-7f2cb0021000 rw-p 00000000 00:00 0 
7f2cb0021000-7f2cb4000000 ---p 00000000 00:00 0 
7f2cb7fff000-7f2cc0000000 rw-s 00000000 fd:00 202783851                  /tmp/openmpi-sessions-vagrant@localhost_0/63096/1/shared_mem_pool.localhost (deleted)
7f2cc0000000-7f2cc0021000 rw-p 00000000 00:00 0 
7f2cc0021000-7f2cc4000000 ---p 00000000 00:00 0 
7f2cc48f1000-7f2cc4cf2000 rw-s 00000000 fd:00 135477240                  /tmp/openmpi-sessions-vagrant@localhost_0/63096/1/2/vader_segment.localhost.2
7f2cc4cf2000-7f2cc50f3000 rw-s 00000000 fd:00 68300033                   /tmp/openmpi-sessions-vagrant@localhost_0/63096/1/1/vader_segment.localhost.1
7f2cc50f3000-7f2cc54f4000 rw-s 00000000 fd:00 1474379                    /tmp/openmpi-sessions-vagrant@localhost_0/63096/1/0/vader_segment.localhost.0
7f2cc54f4000-7f2cc54ff000 r-xp 00000000 fd:00 2626640                    /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libnss_files-2.23.so
7f2cc54ff000-7f2cc56fe000 ---p 0000b000 fd:00 2626640                    /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libnss_files-2.23.so
7f2cc56fe000-7f2cc56ff000 r--p 0000a000 fd:00 2626640                    /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libnss_files-2.23.so
7f2cc56ff000-7f2cc5700000 rw-p 0000b000 fd:00 2626640                    /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libnss_files-2.23.so
7f2cc5700000-7f2cc5706000 rw-p 00000000 00:00 0 
7f2cc5706000-7f2cc5707000 ---p 00000000 00:00 0 
7f2cc5707000-7f2cc5f07000 rw-p 00000000 00:00 0                          [stack:9979]
7f2cc5f07000-7f2cc5f2b000 r-xp 00000000 fd:00 4240669                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/liblzma.so.5.2.2
7f2cc5f2b000-7f2cc612b000 ---p 00024000 fd:00 4240669                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/liblzma.so.5.2.2
7f2cc612b000-7f2cc612c000 r--p 00024000 fd:00 4240669                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/liblzma.so.5.2.2
7f2cc612c000-7f2cc612d000 rw-p 00025000 fd:00 4240669                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/liblzma.so.5.2.2
7f2cc612d000-7f2cc6142000 r-xp 00000000 fd:00 1363668                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libz.so.1.2.8
7f2cc6142000-7f2cc6341000 ---p 00015000 fd:00 1363668                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libz.so.1.2.8
7f2cc6341000-7f2cc6342000 r--p 00014000 fd:00 1363668                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libz.so.1.2.8
7f2cc6342000-7f2cc6343000 rw-p 00015000 fd:00 1363668                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libz.so.1.2.8
7f2cc6343000-7f2cc7bbf000 r--p 00000000 fd:00 1549735                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicudata.so.57.1
7f2cc7bbf000-7f2cc7dbe000 ---p 0187c000 fd:00 1549735                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicudata.so.57.1
7f2cc7dbe000-7f2cc7dbf000 r--p 0187b000 fd:00 1549735                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicudata.so.57.1
7f2cc7dbf000-7f2cc7f4d000 r-xp 00000000 fd:00 1549736                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicuuc.so.57.1
7f2cc7f4d000-7f2cc814d000 ---p 0018e000 fd:00 1549736                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicuuc.so.57.1
7f2cc814d000-7f2cc815f000 r--p 0018e000 fd:00 1549736                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicuuc.so.57.1
7f2cc815f000-7f2cc8160000 rw-p 001a0000 fd:00 1549736                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicuuc.so.57.1
7f2cc8160000-7f2cc8162000 rw-p 00000000 00:00 0 
7f2cc8162000-7f2cc83c3000 r-xp 00000000 fd:00 1549762                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicui18n.so.57.1
7f2cc83c3000-7f2cc85c3000 ---p 00261000 fd:00 1549762                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicui18n.so.57.1
7f2cc85c3000-7f2cc85d0000 r--p 00261000 fd:00 1549762                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicui18n.so.57.1
7f2cc85d0000-7f2cc85d2000 rw-p 0026e000 fd:00 1549762                    /u/sw/pkgs/toolchains/gcc-glibc/5/base/lib/libicui18n.so.57.1
7f2cc85d2000-7f2cc85d3000 rw-p 00000000 00:00 0 
7f2cc85d3000-7f2cc85d5000 r-xp 00000000 fd:00 2590182                    /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libdl-2.23.so[localhost:09977] *** Process received signal ***
[localhost:09977] Signal: Aborted (6)
[localhost:09977] Signal code:  (-6)
[localhost:09977] [ 0] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libpthread.so.0(+0x10e20)[0x7f2cc9ac1e20]
[localhost:09977] [ 1] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(gsignal+0x38)[0x7f2cc9743228]
[localhost:09977] [ 2] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(abort+0x16a)[0x7f2cc97446aa]
[localhost:09977] [ 3] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(+0x6f2e9)[0x7f2cc977f2e9]
[localhost:09977] [ 4] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(+0x74d16)[0x7f2cc9784d16]
[localhost:09977] [ 5] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(+0x754fe)[0x7f2cc97854fe]
[localhost:09977] [ 6] main[0x405dd6]
[localhost:09977] [ 7] main[0x40556c]
[localhost:09977] [ 8] main[0x404930]
[localhost:09977] [ 9] main[0x404429]
[localhost:09977] [10] main[0x40396f]
[localhost:09977] [11] main[0x404eb7]
[localhost:09977] [12] main[0x404334]
[localhost:09977] [13] main[0x403833]
[localhost:09977] [14] main[0x4023ee]
[localhost:09977] [15] main[0x40178e]
[localhost:09977] [16] /u/sw/pkgs/toolchains/gcc-glibc/5/prefix/lib/libc.so.6(__libc_start_main+0xf0)[0x7f2cc97306b0]
[localhost:09977] [17] main[0x401389]
[localhost:09977] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 9977 on node localhost exited on signal 6 (Aborted).
--------------------------------------------------------------------------
make: *** [Makefile:26: run] Error 134

请有人可以帮助我了解发生了什么以及如何解决?谢谢大家。

0 个答案:

没有答案