Question

我正在尝试使用python运行散点图和示例但有问题。为了确保我的集群运行良好，我尝试了一个helloworld：

$ cat /var/nfs/helloworld.py 
#!/usr/bin/env python
"""
Parallel Hello World
"""

from mpi4py import MPI
import sys

size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()

sys.stdout.write( "Hello, World! I am process %d of %d on %s.\n" % (rank, size, name))

所以查看我的机器文件：

$ cat /var/nfs/machinefile 
node1:8
node2:8
desktop01:8

所以从lscpu -p查看树节点的输出

$ lscpu -p
# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,0,0,0,,0,0,0,0
5,1,0,0,,1,1,1,0
6,2,0,0,,2,2,2,0
7,3,0,0,,3,3,3,0

并使用以下命令按预期运行

$ mpiexec.hydra  -np 24  --machinefile /var/nfs/machinefile python /var/nfs/helloworld.py 
Hello, World! I am process 22 of 24 on desktop01.
Hello, World! I am process 19 of 24 on desktop01.
Hello, World! I am process 20 of 24 on desktop01.
Hello, World! I am process 18 of 24 on desktop01.
Hello, World! I am process 23 of 24 on desktop01.
Hello, World! I am process 16 of 24 on desktop01.
Hello, World! I am process 21 of 24 on desktop01.
Hello, World! I am process 17 of 24 on desktop01.
Hello, World! I am process 5 of 24 on node1.
Hello, World! I am process 0 of 24 on node1.
Hello, World! I am process 3 of 24 on node1.
Hello, World! I am process 4 of 24 on node1.
Hello, World! I am process 15 of 24 on node2.
Hello, World! I am process 13 of 24 on node2.
Hello, World! I am process 11 of 24 on node2.
Hello, World! I am process 8 of 24 on node2.
Hello, World! I am process 6 of 24 on node1.
Hello, World! I am process 1 of 24 on node1.
Hello, World! I am process 10 of 24 on node2.
Hello, World! I am process 12 of 24 on node2.
Hello, World! I am process 14 of 24 on node2.
Hello, World! I am process 9 of 24 on node2.
Hello, World! I am process 7 of 24 on node1.
Hello, World! I am process 2 of 24 on node1.

$

我假设我的集群正在运行。

所以现在我尝试了一个演示，它带有mpi4py（2.0.0），我使用的是python 3，所有节点都是Linux，运行mpich2（3.1.2）

$ cat /var/nfs/3.py

#!/usr/bin/env python3

from __future__ import division

import numpy as np
from mpi4py import MPI


comm = MPI.COMM_WORLD

print("-"*78)
print(" Running on %d cores" % comm.size)
print("-"*78)

my_N = 4
N = my_N * comm.size

if comm.rank == 0:
    A = np.arange(N, dtype=np.float64)
else:
    A = np.empty(N, dtype=np.float64)

my_A = np.empty(my_N, dtype=np.float64)

# Scatter data into my_A arrays
comm.Scatter( [A, MPI.DOUBLE], [my_A, MPI.DOUBLE] )

print("After Scatter:")
for r in range(comm.size):
    if comm.rank == r:
        print("[%d] %s" % (comm.rank, my_A))
    comm.Barrier()

# Everybody is multiplying by 2
my_A *= 2

# Allgather data into A again
comm.Allgather( [my_A, MPI.DOUBLE], [A, MPI.DOUBLE] )

print("After Allgather:")
for r in range(comm.size):
    if comm.rank == r:
        print("[%d] %s" % (comm.rank, A))
    comm.Barrier()

尝试运行它，它失败了：

$ mpiexec.hydra  -np 24   --machinefile /var/nfs/machinefile /var/nfs/3.py 
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
[0] [ 0.  1.  2.  3.]
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
Traceback (most recent call last):
  File "/var/nfs/3.py", line 32, in <module>
    comm.Barrier()
  File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..: 
MPIR_Barrier(292).......: 
MPIR_Barrier_intra(149).: 
barrier_smp_intra(109)..: 
MPIR_Bcast_impl(1458)...: 
MPIR_Bcast(1482)........: 
MPIR_Bcast_intra(1291)..: 
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
  File "/var/nfs/3.py", line 32, in <module>
    comm.Barrier()
  File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..: 
MPIR_Barrier(292).......: 
MPIR_Barrier_intra(149).: 
barrier_smp_intra(109)..: 
MPIR_Bcast_impl(1458)...: 
MPIR_Bcast(1482)........: 
MPIR_Bcast_intra(1291)..: 
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
  File "/var/nfs/3.py", line 32, in <module>
    comm.Barrier()
  File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..: 
MPIR_Barrier(292).......: 
MPIR_Barrier_intra(149).: 
barrier_smp_intra(109)..: 
MPIR_Bcast_impl(1458)...: 
MPIR_Bcast(1482)........: 
MPIR_Bcast_intra(1291)..: 
MPIR_Bcast_binomial(309): Failure during collective
*** stack smashing detected ***: python3 terminated
Traceback (most recent call last):
  File "/var/nfs/3.py", line 32, in <module>
    comm.Barrier()
  File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..: 
MPIR_Barrier(292).......: 
MPIR_Barrier_intra(149).: 
barrier_smp_intra(109)..: 
MPIR_Bcast_impl(1458)...: 
MPIR_Bcast(1482)........: 
MPIR_Bcast_intra(1291)..: 
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
  File "/var/nfs/3.py", line 32, in <module>
    comm.Barrier()
  File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..: 
MPIR_Barrier(292).......: 
MPIR_Barrier_intra(149).: 
barrier_smp_intra(109)..: 
MPIR_Bcast_impl(1458)...: 
MPIR_Bcast(1482)........: 
MPIR_Bcast_intra(1291)..: 
MPIR_Bcast_binomial(309): Failure during collective

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 6022 RUNNING AT node1
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:2@desktop01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:2@desktop01] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@desktop01] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@desktop01] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@desktop01] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@desktop01] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@desktop01] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion

但是，要求在一个节点中运行所有内容，运行良好：

$ mpiexec.hydra  -np 8   --machinefile /var/nfs/machinefile /var/nfs/3.py 
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[0] [ 0.  1.  2.  3.]
After Allgather:
[0] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[1] [ 4.  5.  6.  7.]
After Allgather:
[1] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[2] [  8.   9.  10.  11.]
After Allgather:
[2] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[3] [ 12.  13.  14.  15.]
After Allgather:
[3] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[4] [ 16.  17.  18.  19.]
After Allgather:
[4] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[5] [ 20.  21.  22.  23.]
After Allgather:
[5] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[6] [ 24.  25.  26.  27.]
After Allgather:
[6] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[7] [ 28.  29.  30.  31.]
After Allgather:
[7] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]

$

我做错了什么？我应该向scatter / gatter方法发送的数据量是否存在不匹配？这是mpich2的一个已知问题吗？我很确定它曾经工作过，使用openmpi + python 2 - 但我现在无法测试它。

具有scatter和getter示例的mpich2和mpi4py问题

0 个答案: