如何解决mpirun注意到节点aliihsan上PID为0的进程等级1在信号11上退出(分段故障)?

时间:2019-12-25 19:55:52

标签: c mpi openmpi

嗨,我正在尝试乘以nxn维矩阵。我正在使用两个处理器与openmpi库进行这种乘法。例如,我有两个4x4矩阵。我想做一个处理器将完成一半的乘法运算,而另一个处理器将完成其他乘法运算。

9 8       1 2
3 5       3 4

第一个处理器将把第一矩阵的第一行和整个第二矩阵相乘。第二处理器将第一矩阵第二行与整个第二矩阵相乘。

这是我为此项目编写的代码:

#define _CRT_SECURE_NO_WARNINGS
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <mpi.h>
#define MATRIX_SIZE 4

void Transpose(int** mat)
{
    for (int i = 0; i < MATRIX_SIZE; i++)
    {
        int temp = 0;
        for (int j = i + 1; j < MATRIX_SIZE; j++)
        {
            temp = mat[i][j];
            mat[i][j] = mat[j][i];
            mat[j][i] = temp;
        }
    }
}
int main(int argc, char **argv)
{
    MPI_Init(&argc, &argv);
    FILE *fp;
    fp = fopen("size_and_time_jik.txt", "a+");

    int rank, size; 
    MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
    MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of process */
    int i, j, k;
    int *p_matrix1[MATRIX_SIZE], *p_matrix2[MATRIX_SIZE], *p_result_matrix[MATRIX_SIZE], *p_result_matrix1[MATRIX_SIZE];
    srand(time(NULL));

    for (i = 0; i < MATRIX_SIZE; i++)
        p_matrix1[i] = (int *)malloc(MATRIX_SIZE * sizeof(int));
    for (int i = 0; i < MATRIX_SIZE; i++)
        p_matrix2[i] = (int *)malloc(MATRIX_SIZE * sizeof(int));
    for (int i = 0; i < MATRIX_SIZE; i++)
        p_result_matrix[i] = (int *)malloc(MATRIX_SIZE * sizeof(int));
    for (int i = 0; i < MATRIX_SIZE; i++)
        p_result_matrix1[i] = (int *)malloc(MATRIX_SIZE * sizeof(int));

    if (p_matrix1 == NULL || p_matrix2 == NULL || p_result_matrix == NULL || p_result_matrix1 == NULL)
    {
        printf("cannot allocate memory\n");
        exit(EXIT_FAILURE);
    }
    for (i = 0; i < MATRIX_SIZE; i++)
        for (j = 0; j < MATRIX_SIZE; j++)
        {
            p_matrix1[i][j] = rand() % 10 + 1;
            p_matrix2[i][j] = rand() % 10 + 1;
            p_result_matrix[i][j] = 0;
            p_result_matrix1[i][j] = 0;
        }

    for (i = 0; i < MATRIX_SIZE; i++)
    {
        for (j = 0; j < MATRIX_SIZE; j++)
            printf("%d ", p_matrix1[i][j]);
        printf("\n");
    }
    printf("-------------------------------------------------\n");
    for (i = 0; i < MATRIX_SIZE; i++)
    {
        for (j = 0; j < MATRIX_SIZE; j++)
            printf("%d ", p_matrix2[i][j]);
        printf("\n");
    }

    double start_time = MPI_Wtime();
    Transpose(p_matrix2);
    if(rank==0)
    {
        for (j = 0; j < MATRIX_SIZE/size; j++)
            for (i = 0; i < MATRIX_SIZE; i++)
                for (k = 0; k < MATRIX_SIZE; k++)
                    p_result_matrix[i][j] += p_matrix1[i][k] * p_matrix2[j][k];
    }
    else if(rank==1)
    {
        for (j = MATRIX_SIZE/size; j < MATRIX_SIZE; j++)    
            for (i = 0; i < MATRIX_SIZE; i++)
                for (k = 0; k < MATRIX_SIZE; k++)
                    p_result_matrix[i][j] += p_matrix1[i][k] * p_matrix2[j][k];

    }

    if(rank==1)
        MPI_Send(p_result_matrix, MATRIX_SIZE*MATRIX_SIZE, MPI_INT, 0, 37, MPI_COMM_WORLD);
    else if(rank==0)    
        MPI_Recv(p_result_matrix1, MATRIX_SIZE*MATRIX_SIZE, MPI_INT, 1, 37, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

    // if(rank==0)
    // {
        for (i = 0; i < MATRIX_SIZE; i++)   
            for (j = 0; j < MATRIX_SIZE; j++)
                    p_result_matrix[i][j] = p_result_matrix[i][j] + p_result_matrix1[i][j];
        printf("-------------------------------------------------\n");
        for (i = 0; i < MATRIX_SIZE; i++)
        {
            for (j = 0; j < MATRIX_SIZE; j++)
                printf("%d ", p_result_matrix[i][j]);
            printf("\n");
        }
    // }
    double stop_time = MPI_Wtime();

    double elapsed_time = stop_time - start_time;
    printf("Elapsed time = %f\n", elapsed_time);
    fprintf(fp, "%5d ---> %f\n", MATRIX_SIZE, elapsed_time);
    fclose(fp);
    getchar();
    MPI_Finalize();
    return 0;
}

这是结果:

9 8 9 5 
3 5 1 5 
10 8 1 6 
7 6 4 6 
-------------------------------------------------
3 1 5 3 
1 8 4 9 
9 8 9 5 
3 5 1 5 
10 8 1 6 
7 6 4 6 
-------------------------------------------------
3 1 5 3 
1 8 4 9 
4 10 7 4 
7 4 7 5 
4 10 7 4 
7 4 7 5 
-------------------------------------------------
0 0 175 160 
0 0 77 83 
0 0 131 136 
0 0 129 121 
Elapsed time = 0.000032
[aliihsan:08600] *** Process received signal ***
[aliihsan:08600] Signal: Segmentation fault (11)
[aliihsan:08600] Signal code: Address not mapped (1)
[aliihsan:08600] Failing at address: 0x563e75488fb0
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node aliihsan exited on signal 11 (Segmentation fault).

我该如何解决分段错误? 预先感谢您的帮助。

0 个答案:

没有答案