当使用超过5名工人时,MPI显然陷入僵局

时间:2014-09-25 17:55:25

标签: python numpy scipy mpi

我正在编写一个python脚本,它使用MPI向工作人员发送未排序的数组,这将对所述数组进行排序并将它们返回到主数据库。

使用mpirun -n 2 python mpi_sort.py最多mpirun -n 5 python mpi_sort.py运行它可以正常工作,但是当数组的数量太大而且工作人员永远不会停止时,DIE消息似乎会丢失。

运行超过5个工作程序,脚本在执行时很早就停止了。通常,工作人员将获得第一批数组,返回主数据库,并且永远不会再获得更多工作。我很难过为什么会这样。

更糟糕的是,如果我减少阵列的大小或数量,更多的工作人员似乎做得很好。

代码如下:

#!/usr/bin/ENV python
import numpy
from mpi4py import MPI

NUMARRAYS = 1000
ARRAYSIZE = 10000

ASK_FOR_WORK_TAG = 1
WORK_TAG = 2
DIE_TAG = 3

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
status = MPI.Status()

# Master
if rank == 0:
    data = numpy.empty(ARRAYSIZE, dtype=numpy.int32)
    sorted_data = numpy.empty([NUMARRAYS, ARRAYSIZE], dtype=numpy.int32)
    sorted_arrays = 0

    while sorted_arrays < NUMARRAYS:
        print "[Master] Probing"
        comm.Recv(data, source=MPI.ANY_SOURCE, tag=MPI.ANY_TAG, status=status)
        print "[Master] Probed"

        dest = status.Get_source()
        print "[Master] got request for work from worker %d" % dest

        data = numpy.random.random_integers(0, ARRAYSIZE, ARRAYSIZE).astype(numpy.int32)
        print "[Master] sending work to Worker %d" % dest
        comm.Send([data, ARRAYSIZE, MPI.INT], dest=dest, tag=WORK_TAG)
        print "[Master] sent work to Worker %d" % dest

        print "[Master] waiting for complete work from someone"
        comm.Recv([data, ARRAYSIZE, MPI.INT], source=MPI.ANY_SOURCE, tag=MPI.ANY_TAG, status=status)
        print "[Master] got results from Worker %d. Storing in line %d" % (status.Get_source(), sorted_arrays)
        sorted_data[sorted_arrays] = numpy.copy(data)
        numpy.savetxt("sample", data, newline=" ", fmt="%d")
        sorted_arrays += 1

    for dest in range(1, size):
        print "[Master] Telling Worker %d to DIE DIE DIE" % dest
        comm.Send(data, dest=dest, tag=DIE_TAG)

# Slave
else:
    # Ask for work
    data = numpy.empty(ARRAYSIZE, dtype=numpy.int32)
    while True:
        print "[Worker %d] asking for work" % rank
        comm.Send(data, dest=0, tag=ASK_FOR_WORK_TAG)
        print "[Worker %d] sent request for work" % rank

        comm.Recv(data, source=0, tag=MPI.ANY_TAG, status=status)

        if status.Get_tag() == WORK_TAG:
            print "[Worker %d] got work" % rank

            print "[Worker %d] is sorting the array" % rank
            data.sort()
            print "[Worker %d] finished work. Sending it back" % rank
            comm.Send([data, ARRAYSIZE, MPI.INT], dest=0, tag=ASK_FOR_WORK_TAG)
        else:
            print "[Worker %d] DIE DIE DIE" % rank
            break

1 个答案:

答案 0 :(得分:0)

我发现了问题。

有一些僵局,比如@mgilson建议。

首先,工人会把工作送回去,但是主人会将其解释为工作请求,而工人并不期望这样做。

然后,有一个类似的杀戮信息问题。 DIE消息将发送给不期待它们的工作人员。

最终解决方案是:

#!/usr/bin/ENV python
import numpy
from mpi4py import MPI

NUMARRAYS = 100
ARRAYSIZE = 10000

ASK_FOR_WORK_TAG = 1
WORK_TAG = 2
WORK_DONE_TAG = 3
DIE_TAG = 4

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
status = MPI.Status()

# Master
if rank == 0:
    data = numpy.empty(ARRAYSIZE, dtype=numpy.int32)
    sorted_data = numpy.empty([NUMARRAYS, ARRAYSIZE], dtype=numpy.int32)
    sorted_arrays = 0
    dead_workers = 0

    while dead_workers < size - 1:
        print "[Master] Probing"
        comm.Recv([data, ARRAYSIZE, MPI.INT], source=MPI.ANY_SOURCE, tag=MPI.ANY_TAG, status=status)
        print "[Master] Probed"

        dest = status.Get_source()
        if status.Get_tag() == ASK_FOR_WORK_TAG:
            if sorted_arrays <= NUMARRAYS - 1:
                print "[Master] got request for work from worker %d" % dest

                data = numpy.random.random_integers(0, ARRAYSIZE, ARRAYSIZE).astype(numpy.int32)
                print "[Master] sending work to Worker %d" % dest
                comm.Send([data, ARRAYSIZE, MPI.INT], dest=dest, tag=WORK_TAG)
                print "[Master] sent work to Worker %d" % dest
            else:
                # Someone did more work than they should have
                print "[Master] Telling worker %d to DIE DIE DIE" % dest
                comm.Send([data, ARRAYSIZE, MPI.INT], dest=dest, tag=DIE_TAG)
                dead_workers += 1
                print "[Master] Already killed %d workers" % dead_workers

        elif status.Get_tag() == WORK_DONE_TAG:
            if sorted_arrays <= NUMARRAYS - 1:
                print "[Master] got results from Worker %d. Storing in line %d" % (status.Get_source(), sorted_arrays)
                sorted_data[sorted_arrays] = numpy.copy(data)
                numpy.savetxt("sample", data, newline=" ", fmt="%d")
                sorted_arrays += 1

# Slave
else:
    # Ask for work
    data = numpy.empty(ARRAYSIZE, dtype=numpy.int32)
    while True:
        print "[Worker %d] asking for work" % rank
        comm.Send([data, ARRAYSIZE, MPI.INT], dest=0, tag=ASK_FOR_WORK_TAG)
        print "[Worker %d] sent request for work" % rank

        comm.Recv([data, ARRAYSIZE, MPI.INT], source=0, tag=MPI.ANY_TAG, status=status)

        if status.Get_tag() == WORK_TAG:
            print "[Worker %d] got work" % rank

            print "[Worker %d] is sorting the array" % rank
            data.sort()
            print "[Worker %d] finished work. Sending it back" % rank
            comm.Send([data, ARRAYSIZE, MPI.INT], dest=0, tag=WORK_DONE_TAG)
        elif status.Get_tag() == DIE_TAG:
            print "[Worker %d] DIE DIE DIE" % rank
            break
        else:
            print "[Worker %d] Doesn't know what to to with tag %d right now" % (rank, status.Get_tag())