我正在编写一个用于矩阵乘法的并行计算的程序,并且正在使用MPI_Isend
函数进行无阻塞通信。这是主要的并行化逻辑::
MPI_Wait
然后叫MPI_Recv
,等待工人完成工作。但是,实际上,只有当主节点调用MPI_Wait
时,工作人员才能接收到消息;因此,当主节点执行其工作分担时,其他节点将静止不动。为什么会这样呢?
#include "helpers.h"
#include <mpi.h>
int main(int argc, char const *argv[]) {
int m = atoi(argv[1]);
int n = atoi(argv[2]);
int p = atoi(argv[3]);
double start, elapsed;
MPI_Init(NULL, NULL);
int world_size, world_rank, name_len;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Get_processor_name(processor_name, &name_len);
printf("Hello world from processor %s, rank %d out of %d processors\n", processor_name, world_rank, world_size);
float *A = malloc(m * n * sizeof(float));
float *B = malloc(n * p * sizeof(float));
float *C = malloc(m * p * sizeof(float));
float *C_serial = malloc(m * p * sizeof(float));
if (world_rank == 0) {
randarr(m*n, A);
} else if (world_rank == 1) {
randarr(n*p, B);
}
MPI_Barrier(MPI_COMM_WORLD);
start = MPI_Wtime();
// Serial multiplication
if (!world_rank) {
Multiply_serial(A, B, C_serial, m, n, p);
elapsed = MPI_Wtime() - start;
printf("[*] Serial multiplication: %f seconds\n", elapsed);
}
// Parallel multiplication
int mpart = m / world_size;
MPI_Request request_ids[(world_size-1)*2];
float *Apart = malloc(mpart * n * sizeof(float));
float *Cpart = malloc(mpart * p * sizeof(float));
// Master node
if (!world_rank) {
for (int i = 1; i < world_size; ++i) {
MPI_Isend(&A[i*mpart*n], mpart*n, MPI_FLOAT, i, 1, MPI_COMM_WORLD, &request_ids[(i-1)*2]);
MPI_Isend(B, n*p, MPI_FLOAT, i, 2, MPI_COMM_WORLD, &request_ids[(i-1)*2 + 1]);
}
printf("[*] Started sending: %f seconds\n", MPI_Wtime() - start);
// Master node's share of multiplication
for (int i = 0; i < mpart; ++i) {
for (int j = 0; j < p; ++j) {
C[i*p + j] = 0.0;
for (int k = 0; k < n; ++k) {
C[i*p + j] += + A[i*n + k] * B[k*p + j];
}
}
}
printf("[*] Multiplied: %f seconds\n", MPI_Wtime() - start);
MPI_Waitall((world_size-1)*2, request_ids, MPI_STATUSES_IGNORE);
printf("[*] Sending completed: %f seconds\n", MPI_Wtime() - start);
for (int i = 1; i < world_size; ++i) {
MPI_Recv(&C[i*mpart*p], mpart*p, MPI_FLOAT, i, 3, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
elapsed = MPI_Wtime() - start;
printf("[*] Parallel multiplication: %f seconds\n", elapsed);
int correct = IsEqual(C, C_serial, m, p);
printf("[*] Parallel correctness: %d\n", correct);
}
// Worker nodes
if (world_rank) {
MPI_Recv(Apart, mpart*n, MPI_FLOAT, 0, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Recv(B, n*p, MPI_FLOAT, 0, 2, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("[*] I'm worker, got work: %f seconds\n", MPI_Wtime() - start);
for (int i = 0; i < mpart; ++i) {
for (int j = 0; j < p; ++j) {
Cpart[i*p + j] = 0.0;
for (int k = 0; k < n; ++k) {
Cpart[i*p + j] += + Apart[i*n + k] * B[k*p + j];
}
}
}
MPI_Send(Cpart, mpart*p, MPI_FLOAT, 0, 3, MPI_COMM_WORLD);
}
MPI_Finalize();
return 0;
}
这是程序的示例运行::
sumit@HAL9000:~/Coding/parallel/parallel-distributed-assignments/A2$ time make mpinb
mpicc mpib.c -o mpib
mpicc mpinb.c -o mpinb
mpirun -n 4 ./mpinb 6000 32 6000
Hello world from processor HAL9000, rank 0 out of 4 processors
Hello world from processor HAL9000, rank 1 out of 4 processors
Hello world from processor HAL9000, rank 2 out of 4 processors
Hello world from processor HAL9000, rank 3 out of 4 processors
[*] Serial multiplication: 4.700643 seconds
[*] Started sending: 4.700674 seconds
[*] Multiplied: 5.790390 seconds
[*] I'm worker, got work: 5.790726 seconds
[*] I'm worker, got work: 5.790901 seconds
[*] Sending completed: 5.790974 seconds
[*] I'm worker, got work: 5.791019 seconds
[*] Parallel multiplication: 6.920345 seconds
[*] Parallel correctness: 1
real 0m7.175s
user 0m27.699s
sys 0m0.479s
答案 0 :(得分:1)
您是否多次运行它,并且其行为是否相同?
尝试告诉proc“ rank = 0”在MPI_Waitall调用之前做一个明显的暂停(可能是10秒?)。因为我不确定这是否如您所愿。我认为这可能仅仅是因为通信过程比3个for循环的执行时间长,以便工人在proc“ rank = 0”之后完成工作。
PS:我也认为您的代码和exec中有2条printf消息是不一样的(代码的“刚发送”和“发送完成”似乎是“开始发送”和“发送完成” ?)