矩阵的MPI_Allreduce()不包含所有值

时间:2019-11-21 22:44:29

标签: c++ c parallel-processing mpi

我有一个双精度矩阵,对于每次迭代,我需要计算新值。此值取决于其邻居。为了加快计算速度,我使用了MPI。我根据尺寸和过程拆分矩阵。然后,每个进程都计算自己的矩阵部分,并将其值放入另一个相同大小的矩阵中,该矩阵先前已填充了零(因此,除了必须计算的值以外,所有值均为零)。在每次迭代结束时,我想使用MPI_Allreduce()对所有这些矩阵求和,并确保每个进程都有新的完整矩阵。

MPI_Init()在调用solvePar()之前被调用。我们还可以使用一些功能,例如:allocateMatrix()deallocateMatrix()printMatrix()

void solvePar(int rows, int cols, int iterations, double td, double h, int sleep, double ** matrix) {
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    int size;
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    double h_square = h * h;

    //split matrix and determine the part of the matrix to calculate
    int maxRow;
    int maxCol;
    int lowerLimRow;
    int upperLimRow;
    int lowerLimCol;
    int upperLimCol;
    if(rows > cols) {
        maxRow = rows/size;
        lowerLimCol = 0;
        upperLimCol = cols;
        lowerLimRow = rank * maxRow;
        upperLimRow = (rank + 1) * maxRow;
    }
    else {
        maxCol = cols/size;;
        lowerLimRow = 0;
        upperLimRow = rows;
        lowerLimCol = rank * maxCol;
        upperLimCol = (rank + 1) * maxCol;
    }
    if(rank == (size - 1)) {
        upperLimRow = rows;
        upperLimCol = cols;
    }

    double ** newMatrix = allocateMatrix(rows, cols);

    for(int k = 0; k < iterations; k++) {
        newMatrix = fillWithZeros(rows, cols, newMatrix);
        for(int i = lowerLimRow; i < upperLimRow; i++) {
            for(int j = lowerLimCol; j < upperLimCol; j++) {
                double value = 0;
                if(i > 0 && j > 0 && i < (rows - 1) && j < (cols - 1)) {
                    value = calculateValue(i, j, td, h_square, matrix);
                }
                newMatrix[i][j] = value;    
                sleep_for(microseconds(sleep)); 
            } 
        }
        //printMatrix(rows, cols, newMatrix);

        //here, everything is still fine

        MPI_Barrier(MPI_COMM_WORLD);
        double ** recvMatrix = allocateMatrix(rows, cols);
        MPI_Allreduce(*newMatrix, *recvMatrix, rows*cols, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
        matrix = recvMatrix;
        if(rank == 0) {
            printMatrix(rows, cols, recvMatrix);
        }
    }
    if(rank != 0) {
        deallocateMatrix(rows, matrix);
    }
    deallocateMatrix(rows, newMatrix);
} 

double calculateValue(int i, int j, double td, double h_square, double ** matrix) {
    double c, l, r, t, b;
    c = matrix[i][j];
    t = matrix[i - 1][j];
    b = matrix[i + 1][j];
    l = matrix[i][j - 1];
    r = matrix[i][j + 1];
    double value = c * (1.0 - 4.0 * td / h_square) + (t + b + l + r) * (td / h_square);
    return value;
}

double ** fillWithZeros(int rows, int cols, double ** matrix) {
     for(int row = 0; row < rows; row++) {
        for(int col = 0; col < cols; col++) {
            matrix[row][col] = 0;
        }
     }
     return matrix;
} 

当我使用2个进程(带有以下参数:rows = cols = 9; iterations = 360; td = 0.00025; h = 0.1)执行代码时,遇到了段错误,并且recvMatrix不包含其他进程的所有值。

recvMatrix经过一次迭代:

        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00
        0.00       48.30       83.05      103.90      110.85      103.90       83.05       48.30        0.00
        0.00       83.05      142.80      178.65        0.00        0.00        0.00        0.00        0.00
        0.00      103.90      178.65      223.50        0.00        0.00        0.00        0.00        0.00
        0.00      110.85      190.60      238.45        0.00        0.00        0.00        0.00        0.00
        0.00      103.90      178.65      223.50        0.00        0.00        0.00        0.00        0.00
        0.00       83.05      142.80      178.65        0.00        0.00        0.00        0.00        0.00
        0.00       48.30       83.05      103.90        0.00        0.00        0.00        0.00        0.00
        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00        0.00

然后给出细分错误。

[logti-a3326-11l:01398] *** Process received signal ***
[logti-a3326-11l:01398] Signal: Segmentation fault (11)
[logti-a3326-11l:01398] Signal code:  (128)
[logti-a3326-11l:01398] Failing at address: (nil)
[logti-a3326-11l:01398] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f2efbaf1890]
[logti-a3326-11l:01398] [ 1] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x106)[0x7f2eea721b56]
[logti-a3326-11l:01398] [ 2] /usr/lib/x86_64-linux-gnu/libopen-pal.so.20(opal_progress+0x12c)[0x7f2efb1daabc]
[logti-a3326-11l:01398] [ 3] /usr/lib/x86_64-linux-gnu/libmpi.so.20(ompi_request_default_wait_all+0x2e5)[0x7f2efc2e53f5]
[logti-a3326-11l:01398] [ 4] /usr/lib/x86_64-linux-gnu/libmpi.so.20(ompi_coll_base_allreduce_intra_recursivedoubling+0x497)[0x7f2efc335cb7]
[logti-a3326-11l:01398] [ 5] /usr/lib/x86_64-linux-gnu/libmpi.so.20(PMPI_Allreduce+0x16a)[0x7f2efc2f547a]
[logti-a3326-11l:01398] [ 6] ./lab3(+0xc704)[0x559d98e4f704]
[logti-a3326-11l:01398] [ 7] ./lab3(+0x1049d)[0x559d98e5349d]
[logti-a3326-11l:01398] [ 8] ./lab3(+0xbe8e)[0x559d98e4ee8e]
[logti-a3326-11l:01398] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f2efb70fb97]
[logti-a3326-11l:01398] [10] ./lab3(+0xc00a)[0x559d98e4f00a]
[logti-a3326-11l:01398] *** End of error message ***

编辑:

double ** allocateMatrix(int rows, int cols) {
    double ** matrix = new double*[rows];

    for(int i = 0; i < rows; i++) {
        matrix[i] = new double[cols];
    }

    return matrix;
}

0 个答案:

没有答案