我正在尝试使用python运行散点图和示例但有问题。为了确保我的集群运行良好,我尝试了一个helloworld:
$ cat /var/nfs/helloworld.py
#!/usr/bin/env python
"""
Parallel Hello World
"""
from mpi4py import MPI
import sys
size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()
sys.stdout.write( "Hello, World! I am process %d of %d on %s.\n" % (rank, size, name))
所以查看我的机器文件:
$ cat /var/nfs/machinefile
node1:8
node2:8
desktop01:8
所以从lscpu -p
查看树节点的输出
$ lscpu -p
# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,0,0,0,,0,0,0,0
5,1,0,0,,1,1,1,0
6,2,0,0,,2,2,2,0
7,3,0,0,,3,3,3,0
并使用以下命令按预期运行
$ mpiexec.hydra -np 24 --machinefile /var/nfs/machinefile python /var/nfs/helloworld.py
Hello, World! I am process 22 of 24 on desktop01.
Hello, World! I am process 19 of 24 on desktop01.
Hello, World! I am process 20 of 24 on desktop01.
Hello, World! I am process 18 of 24 on desktop01.
Hello, World! I am process 23 of 24 on desktop01.
Hello, World! I am process 16 of 24 on desktop01.
Hello, World! I am process 21 of 24 on desktop01.
Hello, World! I am process 17 of 24 on desktop01.
Hello, World! I am process 5 of 24 on node1.
Hello, World! I am process 0 of 24 on node1.
Hello, World! I am process 3 of 24 on node1.
Hello, World! I am process 4 of 24 on node1.
Hello, World! I am process 15 of 24 on node2.
Hello, World! I am process 13 of 24 on node2.
Hello, World! I am process 11 of 24 on node2.
Hello, World! I am process 8 of 24 on node2.
Hello, World! I am process 6 of 24 on node1.
Hello, World! I am process 1 of 24 on node1.
Hello, World! I am process 10 of 24 on node2.
Hello, World! I am process 12 of 24 on node2.
Hello, World! I am process 14 of 24 on node2.
Hello, World! I am process 9 of 24 on node2.
Hello, World! I am process 7 of 24 on node1.
Hello, World! I am process 2 of 24 on node1.
$
我假设我的集群正在运行。
所以现在我尝试了一个演示,它带有mpi4py(2.0.0),我使用的是python 3,所有节点都是Linux,运行mpich2(3.1.2)
$ cat /var/nfs/3.py
#!/usr/bin/env python3
from __future__ import division
import numpy as np
from mpi4py import MPI
comm = MPI.COMM_WORLD
print("-"*78)
print(" Running on %d cores" % comm.size)
print("-"*78)
my_N = 4
N = my_N * comm.size
if comm.rank == 0:
A = np.arange(N, dtype=np.float64)
else:
A = np.empty(N, dtype=np.float64)
my_A = np.empty(my_N, dtype=np.float64)
# Scatter data into my_A arrays
comm.Scatter( [A, MPI.DOUBLE], [my_A, MPI.DOUBLE] )
print("After Scatter:")
for r in range(comm.size):
if comm.rank == r:
print("[%d] %s" % (comm.rank, my_A))
comm.Barrier()
# Everybody is multiplying by 2
my_A *= 2
# Allgather data into A again
comm.Allgather( [my_A, MPI.DOUBLE], [A, MPI.DOUBLE] )
print("After Allgather:")
for r in range(comm.size):
if comm.rank == r:
print("[%d] %s" % (comm.rank, A))
comm.Barrier()
尝试运行它,它失败了:
$ mpiexec.hydra -np 24 --machinefile /var/nfs/machinefile /var/nfs/3.py
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
[0] [ 0. 1. 2. 3.]
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
Traceback (most recent call last):
File "/var/nfs/3.py", line 32, in <module>
comm.Barrier()
File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
File "/var/nfs/3.py", line 32, in <module>
comm.Barrier()
File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
File "/var/nfs/3.py", line 32, in <module>
comm.Barrier()
File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective
*** stack smashing detected ***: python3 terminated
Traceback (most recent call last):
File "/var/nfs/3.py", line 32, in <module>
comm.Barrier()
File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
File "/var/nfs/3.py", line 32, in <module>
comm.Barrier()
File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 6022 RUNNING AT node1
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:2@desktop01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:2@desktop01] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@desktop01] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@desktop01] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@desktop01] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@desktop01] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@desktop01] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion
但是,要求在一个节点中运行所有内容,运行良好:
$ mpiexec.hydra -np 8 --machinefile /var/nfs/machinefile /var/nfs/3.py
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[0] [ 0. 1. 2. 3.]
After Allgather:
[0] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[1] [ 4. 5. 6. 7.]
After Allgather:
[1] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[2] [ 8. 9. 10. 11.]
After Allgather:
[2] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[3] [ 12. 13. 14. 15.]
After Allgather:
[3] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[4] [ 16. 17. 18. 19.]
After Allgather:
[4] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[5] [ 20. 21. 22. 23.]
After Allgather:
[5] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[6] [ 24. 25. 26. 27.]
After Allgather:
[6] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[7] [ 28. 29. 30. 31.]
After Allgather:
[7] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
$
我做错了什么?我应该向scatter / gatter方法发送的数据量是否存在不匹配?这是mpich2的一个已知问题吗?我很确定它曾经工作过,使用openmpi + python 2 - 但我现在无法测试它。