在ARMv6发生故障之前,在检查点之后重新启动mpi slave

时间:2013-12-12 00:30:10

标签: arm mpi raspberry-pi master-slave blcr

更新

我有一个大学项目,我应该用RPis建立一个集群。 现在我们有一个功能齐全的BLCR / MPICH系统。 BLCR与与lib链接的正常进程非常有效。 我们必须从管理Web界面展示的演示是:

  1. 并行执行作业
  2. 跨节点迁移流程
  3. MPI容错
  4. 我们被允许使用最简单的计算。 我们很容易得到的第一个,也是MPI。第二点我们实际上只使用正常进程(没有MPI)。关于第三点,我不太了解如何实现主从MPI方案,其中我可以重新启动从属进程,这也影响第二点,因为我们应该/可以/ has_to做一个从属进程的检查点,杀死/停止它并在另一个节点上重新启动它。我知道我必须自己处理MPI_Errors但是如何恢复这个过程?如果有人能给我发贴链接或纸张(附带解释),那将是件好事。

    提前致谢

    更新 如前所述,我们的BLCR + MPICH工作正常或似乎。 但是......当我启动MPI进程时,检查点似乎运行良好。

    这里的证明:

    ... snip ...
    Benchmarking: dynamic_5: md5($s.$p.$s) [32/32 128x1 (MD5_Body)]... DONE
    Many salts: 767744 c/s real, 767744 c/s virtual
    Only one salt:  560896 c/s real, 560896 c/s virtual
    
    Benchmarking: dynamic_5: md5($s.$p.$s) [32/32 128x1 (MD5_Body)]... [proxy:0:0@node2] requesting checkpoint
    [proxy:0:0@node2] checkpoint completed
    [proxy:0:1@node1] requesting checkpoint
    [proxy:0:1@node1] checkpoint completed
    [proxy:0:2@node3] requesting checkpoint
    [proxy:0:2@node3] checkpoint completed
    ... snip ...
    

    如果我在任何节点上杀死一个Slave-Process,我会在这里得到它:

    ... snip ...
    ===================================================================================
    =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
    =   EXIT CODE: 9
    =   CLEANING UP REMAINING PROCESSES
    =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
    ===================================================================================
    ... snip ...
    

    没关系,因为我们有一个检查点,所以我们可以重启我们的应用程序。 但它不起作用:

    pi        7380  0.0  0.2   2984  1012 pts/4    S+   16:38   0:00 mpiexec -ckpointlib blcr -ckpoint-prefix /tmp -ckpoint-num 0 -f /tmp/machinefile -n 3
    pi        7381  0.1  0.5   5712  2464 ?        Ss   16:38   0:00 /usr/bin/ssh -x 192.168.42.101 "/usr/local/bin/mpich/bin/hydra_pmi_proxy" --control-port masterpi:47698 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
    pi        7382  0.1  0.5   5712  2464 ?        Ss   16:38   0:00 /usr/bin/ssh -x 192.168.42.102 "/usr/local/bin/mpich/bin/hydra_pmi_proxy" --control-port masterpi:47698 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
    pi        7383  0.1  0.5   5712  2464 ?        Ss   16:38   0:00 /usr/bin/ssh -x 192.168.42.105 "/usr/local/bin/mpich/bin/hydra_pmi_proxy" --control-port masterpi:47698 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
    pi        7438  0.0  0.1   3548   868 pts/1    S+   16:40   0:00 grep --color=auto mpi
    

    我不知道为什么但是第一次在每个节点上重启应用程序时,这个过程似乎重新开始(我使用 top ps aux | grep“john” 但是没有输出到管理层(或管理控制台/终端)。它只是在显示后挂起:

    mpiexec -ckpointlib blcr -ckpoint-prefix /tmp -ckpoint-num 0 -f /tmp/machinefile -n 3
    Warning: Permanently added '192.168.42.102' (ECDSA) to the list of known hosts.
    Warning: Permanently added '192.168.42.101' (ECDSA) to the list of known hosts.
    Warning: Permanently added '192.168.42.105' (ECDSA) to the list of known hosts.
    

    如果BLCR / MPICH的东西确实有效,我的计划B只是用自己的应用程序进行测试。也许和约翰有些麻烦。

    提前致谢

    **

    更新

    ** 简单问候世界的下一个问题。我慢慢地绝望。也许我太困惑了。

    mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-interval 3 -f /tmp/machinefile -n 4 ./hello
    Warning: Permanently added '192.168.42.102' (ECDSA) to the list of known hosts.
    Warning: Permanently added '192.168.42.105' (ECDSA) to the list of known hosts.
    Warning: Permanently added '192.168.42.101' (ECDSA) to the list of known hosts.
    [proxy:0:0@node2] requesting checkpoint
    [proxy:0:0@node2] checkpoint completed
    [proxy:0:1@node1] requesting checkpoint
    [proxy:0:1@node1] checkpoint completed
    [proxy:0:2@node3] requesting checkpoint
    [proxy:0:2@node3] checkpoint completed
    [proxy:0:0@node2] requesting checkpoint
    [proxy:0:0@node2] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0@node2] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
    [proxy:0:0@node2] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
    [proxy:0:0@node2] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
    [proxy:0:1@node1] requesting checkpoint
    [proxy:0:1@node1] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:1@node1] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
    [proxy:0:1@node1] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
    [proxy:0:1@node1] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
    [proxy:0:2@node3] requesting checkpoint
    [proxy:0:2@node3] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:2@node3] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
    [proxy:0:2@node3] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
    [proxy:0:2@node3] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
    [mpiexec@masterpi] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
    [mpiexec@masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
    [mpiexec@masterpi] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
    [mpiexec@masterpi] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
    

    的hello.c

    /* C Example */
    #include <stdio.h>
    #include <mpi.h>
    
    
    int main (argc, argv)
         int argc;
         char *argv[];
    {
      int rank, size, i, j;
     char hostname[1024];
            hostname[1023] = '\0';
            gethostname(hostname, 1023);
    
      MPI_Init (&argc, &argv);      /* starts MPI */
      MPI_Comm_rank (MPI_COMM_WORLD, &rank);        /* get current process id */
      MPI_Comm_size (MPI_COMM_WORLD, &size);        /* get number of processes */
      i = 0;
      for(i ; i < 400000000; i++){
        for(j; j < 4000000; j++){
            }
      }
            printf("%s done...", hostname);
      printf("%s: %d is alive\n", hostname, getpid());
      MPI_Finalize();
      return 0;
    }
    

0 个答案:

没有答案