MPI4PY共享内存-访问时内存使用量激增

时间:2018-08-08 13:01:05

标签: python mpi shared-memory mpi4py

我正在使用共享内存通过共享窗口与mpi4py共享一个大的numpy数组(一次写入,多次读取)。我发现我可以毫无问题地设置共享阵列,但是,如果我尝试在不是主进程的任何进程上访问该阵列,则我的内存使用量会超出合理范围。我有一个简单的代码段,在这里说明了该应用程序:

from mpi4py import MPI
import numpy as np
import time
import sys

shared_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED)

is_leader = shared_comm.rank == 0

# Set up a large array as example
_nModes = 45
_nSamples = 512*5

float_size = MPI.DOUBLE.Get_size()

size = (_nModes, _nSamples, _nSamples)
if is_leader:
    total_size = np.prod(size)
    nbytes = total_size * float_size
else:
    nbytes = 0

# Create the shared memory, or get a handle based on shared communicator                                                                                                                                           
win = MPI.Win.Allocate_shared(nbytes, float_size, comm=shared_comm)
# Construct the array                                                                                                                                                                                              
buf, itemsize = win.Shared_query(0)
_storedZModes = np.ndarray(buffer=buf, dtype='d', shape=size)

# Fill the shared array with only the leader rank
if is_leader:
    _storedZModes[...] = np.ones(size)

shared_comm.Barrier()

# Access the array - if we don't do this, then memory usage is as expected. If I do this, then I find that memory usage goes up to twice the size, as if it's copying the array on access
if shared_comm.rank == 1:
    # Do a (bad) explicit sum to make clear it is not a copy problem within numpy sum()
    SUM = 0.
    for i in range(_nModes):
        for j in range(_nSamples):
            for k in range(_nSamples):
                SUM = SUM + _storedZModes[i,j,k]                                                                                                                                               

# Wait for a while to make sure slurm notices any issues before finishing
time.sleep(500)

通过上面的设置,共享阵列大约需要2.3GB,这在运行代码和查询时已确认。如果我通过在单个节点上的4个内核上通过slurm提交队列,每个进程有0.75GB,则只有在不执行求和操作时,它才能正常运行。但是,如果求和(如所示,或使用np.sum或类似方法),则slurm会抱怨超出了内存使用率。如果领导者级别进行总和,则不会发生这种情况。

每个进程0.75GB,分配的总内存为3GB,这将为共享阵列以外的所有其他空间提供约0.6GB的内存。显然应该足够。

似乎在除领导者之外的任何进程上访问内存都在复制内存,这显然是无用的。我做错了什么吗?

编辑

我玩过栅栏,并按如下方式使用put / get。我仍然得到相同的行为。如果有人运行此程序并且没有复制问题,那对我来说仍然是有用的信息:)

from mpi4py import MPI
import numpy as np
import time
import sys

shared_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED)
print("Shared comm contains: ", shared_comm.Get_size(), " processes")

shared_comm.Barrier()

leader_rank = 0
is_leader = shared_comm.rank == leader_rank

# Set up a large array as example
_nModes = 45
_nSamples = 512*5

float_size = MPI.DOUBLE.Get_size()

print("COMM has ", shared_comm.Get_size(), " processes")

size = (_nModes, _nSamples, _nSamples)
if is_leader:
    total_size = np.prod(size)
    nbytes = total_size * float_size
    print("Expected array size is ", nbytes/(1024.**3), " GB")
else:
    nbytes = 0

# Create the shared memory, or get a handle based on shared communicator                                                                  

shared_comm.Barrier()                      
win = MPI.Win.Allocate_shared(nbytes, float_size, comm=shared_comm)
# Construct the array                                                                                                                     

buf, itemsize = win.Shared_query(leader_rank)
_storedZModes = np.ndarray(buffer=buf, dtype='d', shape=size)

# Fill the shared array with only the leader rank
win.Fence()
if is_leader:
    print("RANK: ", shared_comm.Get_rank() , " is filling the array ")
    #_storedZModes[...] = np.ones(size)
    win.Put(np.ones(size), leader_rank, 0)
    print("RANK: ", shared_comm.Get_rank() , " SUCCESSFULLY filled the array ")
    print("Sum should return ", np.prod(size))
win.Fence()

# Access the array - if we don't do this, then memory usage is as expected. If I do this, then I find that memory usage goes up to twice t
he size, as if it's copying the array on access
if shared_comm.rank == 1:
    print("RANK: ", shared_comm.Get_rank() , " is querying the array "); sys.stdout.flush()
    # Do a (bad) explicit sum to make clear it is not a copy problem within numpy sum()
    SUM = 0.
    counter = -1; tSUM = np.empty((1,))
    for i in range(_nModes):
        for j in range(_nSamples):
            for k in range(_nSamples):
                if counter%10000 == 0:
                    print("Finished iteration: ", counter); sys.stdout.flush()
                counter += 1; win.Get(tSUM, leader_rank, counter); SUM += tSUM[0];
                #SUM = SUM + _storedZModes[i,j,k]                                                                                         

    print("RANK: ", shared_comm.Get_rank() , " SUCCESSFULLY queried the array ", SUM)

shared_comm.Barrier()

# Wait for a while to make sure slurm notices any issues before finishing
time.sleep(500)

答案

进一步的调查明确表明,问题出在流浪汉中:一个有效地告诉流浪汉忽略共享内存的开关已关闭,并将其打开解决了这个问题。

在可接受的答案中给出了为什么会导致问题的描述。本质上,slurm会计算两个进程的总驻留内存。

1 个答案:

答案 0 :(得分:0)

我用两个MPI任务运行了这个任务,并用toppmap监视了两个任务。

这些工具表明

_storedZModes[...] = np.ones(size)

分配了一个充满1的缓冲区,因此领导者所需的内存确实为2 * nbytes(驻留内存为2 * nbytes,其中共享内存中包含nbytes

来自top

top - 15:14:54 up 43 min,  4 users,  load average: 2.76, 1.46, 1.18
Tasks:   2 total,   1 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.5 us,  6.2 sy,  0.0 ni, 66.2 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem :  3881024 total,   161624 free,  2324936 used,  1394464 buff/cache
KiB Swap:   839676 total,   818172 free,    21504 used.  1258976 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6390 gilles    20   0 2002696  20580   7180 R 100.0  0.5   1:00.39 python
 6389 gilles    20   0 3477268   2.5g   1.1g D  12.3 68.1   0:02.41 python

此操作完成后,将释放充满1的缓冲区,并将内存降为nbytes(驻留内存〜=共享内存)

请注意,那时任务1的常驻内存和共享内存都很小。

top - 15:14:57 up 43 min,  4 users,  load average: 2.69, 1.47, 1.18
Tasks:   2 total,   1 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.2 us,  1.3 sy,  0.0 ni, 71.3 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem :  3881024 total,  1621860 free,   848848 used,  1410316 buff/cache
KiB Swap:   839676 total,   818172 free,    21504 used.  2735168 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6390 gilles    20   0 2002696  20580   7180 R 100.0  0.5   1:03.39 python
 6389 gilles    20   0 2002704   1.1g   1.1g S   2.0 30.5   0:02.47 python

在任务1上计算总和后,常驻内存和共享内存都增加到nbytes

top - 15:18:09 up 46 min,  4 users,  load average: 0.33, 1.01, 1.06
Tasks:   2 total,   0 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.4 us,  2.9 sy,  0.0 ni, 88.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  3881024 total,  1297172 free,   854460 used,  1729392 buff/cache
KiB Swap:   839676 total,   818172 free,    21504 used.  2729768 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6389 gilles    20   0 2002704   1.1g   1.1g S   0.0 30.6   0:02.48 python
 6390 gilles    20   0 2002700   1.4g   1.4g S   0.0 38.5   2:34.42 python

最后,top报告了两个进程,它们的驻留内存大约为nbytes,这大约是共享内存中同一nbytes的单个映射。

我不知道SLURM如何测量内存消耗... 如果它正确地考虑了共享内存,那么它应该很好(例如分配nbytes)。 但是,如果忽略它,它将考虑为您的作业分配了2 * nbytes的(驻留)内存,这可能太多了。

请注意,如果您将初始化替换为

if is_leader:
    for i in range(_nModes):
        for j in range(_nSamples):
            for k in range(_nSamples):
                _storedZModes[i,j,k] = 1

未分配填充1的临时缓冲区,等级0的最大内存消耗为nbytes而不是2 * nbytes