当我尝试运行一个应用程序时(只是一个简单的hello_world.c不起作用)我每次都会收到此错误:
mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-interval 10 -machinefile /tmp/machinefile -n 1 ./app_name
[proxy:0:0@masterpi] requesting checkpoint
[proxy:0:0@masterpi] checkpoint completed
[proxy:0:0@masterpi] requesting checkpoint
[proxy:0:0@masterpi] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0@masterpi] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
[proxy:0:0@masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@masterpi] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@masterpi] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec@masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@masterpi] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec@masterpi] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
我只想制作一个检查点,而不是其他任何东西(稍后重启)。
提前致谢
我尝试过MPICH2,没有机会。或者也许我错了...
pi@raspberrypi ~ $ mpiexec -n 1 -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-interval 2 ./test3
Count to: 0
[proxy:0:0@raspberrypi] requesting checkpoint
[proxy:0:0@raspberrypi] checkpoint completed
Count to: 1
[proxy:0:0@raspberrypi] requesting checkpoint
[proxy:0:0@raspberrypi] HYDT_ckpoint_checkpoint (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0@raspberrypi] HYD_pmcd_pmip_control_cmd_cb (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip_cb.c:902): checkpoint suspend failed
[proxy:0:0@raspberrypi] HYDT_dmxu_poll_wait_for_event (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@raspberrypi] main (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip.c:210): demux engine error waiting for event
[mpiexec@raspberrypi] control_cb (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:201): assert (!closed) failed
[mpiexec@raspberrypi] HYDT_dmxu_poll_wait_for_event (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@raspberrypi] HYD_pmci_wait_for_completion (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[mpiexec@raspberrypi] main (/tmp/mpich/mpich2-1.5/src/pm/hydra/ui/mpich/mpiexec.c:325): process manager error waiting for completion
Test3的码:
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char* argv[]) {
int rank;
int size;
int i = 0;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Status status;
if (rank == 0) {
for(i; i <=100; i++){
int j = 0;
while(j < 100000000){
j++;
}
printf("Count to: %i\n", i);
}
} else {
}
MPI_Finalize();
return 0;
}
我只需要一个成功的检查点并显示重启。 如果有人有一个有效的例子(无关紧要,简单的工作“Hello World”会让我开心!)我会很高兴。
新年快乐!答案 0 :(得分:1)
不幸的是,MPICH 3.0.4中的检查点/重启代码目前已知有问题。希望在将来的版本中得到修复。看起来你可能正确使用它。如果你回到以前的版本,你可能会有更好的运气。
答案 1 :(得分:0)
这里的问题是检查点的间隔太小。 将其设置为20秒或更长时间解决了这个问题(但不是其他:()问题。