更新
我有一个大学项目,我应该用RPis建立一个集群。 现在我们有一个功能齐全的BLCR / MPICH系统。 BLCR与与lib链接的正常进程非常有效。 我们必须从管理Web界面展示的演示是:
我们被允许使用最简单的计算。 我们很容易得到的第一个,也是MPI。第二点我们实际上只使用正常进程(没有MPI)。关于第三点,我不太了解如何实现主从MPI方案,其中我可以重新启动从属进程,这也影响第二点,因为我们应该/可以/ has_to做一个从属进程的检查点,杀死/停止它并在另一个节点上重新启动它。我知道我必须自己处理MPI_Errors但是如何恢复这个过程?如果有人能给我发贴链接或纸张(附带解释),那将是件好事。
提前致谢
更新 如前所述,我们的BLCR + MPICH工作正常或似乎。 但是......当我启动MPI进程时,检查点似乎运行良好。
这里的证明:
... snip ...
Benchmarking: dynamic_5: md5($s.$p.$s) [32/32 128x1 (MD5_Body)]... DONE
Many salts: 767744 c/s real, 767744 c/s virtual
Only one salt: 560896 c/s real, 560896 c/s virtual
Benchmarking: dynamic_5: md5($s.$p.$s) [32/32 128x1 (MD5_Body)]... [proxy:0:0@node2] requesting checkpoint
[proxy:0:0@node2] checkpoint completed
[proxy:0:1@node1] requesting checkpoint
[proxy:0:1@node1] checkpoint completed
[proxy:0:2@node3] requesting checkpoint
[proxy:0:2@node3] checkpoint completed
... snip ...
如果我在任何节点上杀死一个Slave-Process,我会在这里得到它:
... snip ...
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
... snip ...
没关系,因为我们有一个检查点,所以我们可以重启我们的应用程序。 但它不起作用:
pi 7380 0.0 0.2 2984 1012 pts/4 S+ 16:38 0:00 mpiexec -ckpointlib blcr -ckpoint-prefix /tmp -ckpoint-num 0 -f /tmp/machinefile -n 3
pi 7381 0.1 0.5 5712 2464 ? Ss 16:38 0:00 /usr/bin/ssh -x 192.168.42.101 "/usr/local/bin/mpich/bin/hydra_pmi_proxy" --control-port masterpi:47698 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
pi 7382 0.1 0.5 5712 2464 ? Ss 16:38 0:00 /usr/bin/ssh -x 192.168.42.102 "/usr/local/bin/mpich/bin/hydra_pmi_proxy" --control-port masterpi:47698 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
pi 7383 0.1 0.5 5712 2464 ? Ss 16:38 0:00 /usr/bin/ssh -x 192.168.42.105 "/usr/local/bin/mpich/bin/hydra_pmi_proxy" --control-port masterpi:47698 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
pi 7438 0.0 0.1 3548 868 pts/1 S+ 16:40 0:00 grep --color=auto mpi
我不知道为什么但是第一次在每个节点上重启应用程序时,这个过程似乎重新开始(我使用 top 或 ps aux | grep“john” 但是没有输出到管理层(或管理控制台/终端)。它只是在显示后挂起:
mpiexec -ckpointlib blcr -ckpoint-prefix /tmp -ckpoint-num 0 -f /tmp/machinefile -n 3
Warning: Permanently added '192.168.42.102' (ECDSA) to the list of known hosts.
Warning: Permanently added '192.168.42.101' (ECDSA) to the list of known hosts.
Warning: Permanently added '192.168.42.105' (ECDSA) to the list of known hosts.
如果BLCR / MPICH的东西确实有效,我的计划B只是用自己的应用程序进行测试。也许和约翰有些麻烦。
提前致谢
**
** 简单问候世界的下一个问题。我慢慢地绝望。也许我太困惑了。
mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-interval 3 -f /tmp/machinefile -n 4 ./hello
Warning: Permanently added '192.168.42.102' (ECDSA) to the list of known hosts.
Warning: Permanently added '192.168.42.105' (ECDSA) to the list of known hosts.
Warning: Permanently added '192.168.42.101' (ECDSA) to the list of known hosts.
[proxy:0:0@node2] requesting checkpoint
[proxy:0:0@node2] checkpoint completed
[proxy:0:1@node1] requesting checkpoint
[proxy:0:1@node1] checkpoint completed
[proxy:0:2@node3] requesting checkpoint
[proxy:0:2@node3] checkpoint completed
[proxy:0:0@node2] requesting checkpoint
[proxy:0:0@node2] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0@node2] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
[proxy:0:0@node2] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@node2] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:1@node1] requesting checkpoint
[proxy:0:1@node1] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:1@node1] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
[proxy:0:1@node1] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@node1] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:2@node3] requesting checkpoint
[proxy:0:2@node3] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:2@node3] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
[proxy:0:2@node3] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:2@node3] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@masterpi] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec@masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@masterpi] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec@masterpi] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
的hello.c
/* C Example */
#include <stdio.h>
#include <mpi.h>
int main (argc, argv)
int argc;
char *argv[];
{
int rank, size, i, j;
char hostname[1024];
hostname[1023] = '\0';
gethostname(hostname, 1023);
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
i = 0;
for(i ; i < 400000000; i++){
for(j; j < 4000000; j++){
}
}
printf("%s done...", hostname);
printf("%s: %d is alive\n", hostname, getpid());
MPI_Finalize();
return 0;
}