我正在使用共享内存通过共享窗口与mpi4py共享一个大的numpy数组(一次写入,多次读取)。我发现我可以毫无问题地设置共享阵列,但是,如果我尝试在不是主进程的任何进程上访问该阵列,则我的内存使用量会超出合理范围。我有一个简单的代码段,在这里说明了该应用程序:
from mpi4py import MPI
import numpy as np
import time
import sys
shared_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED)
is_leader = shared_comm.rank == 0
# Set up a large array as example
_nModes = 45
_nSamples = 512*5
float_size = MPI.DOUBLE.Get_size()
size = (_nModes, _nSamples, _nSamples)
if is_leader:
total_size = np.prod(size)
nbytes = total_size * float_size
else:
nbytes = 0
# Create the shared memory, or get a handle based on shared communicator
win = MPI.Win.Allocate_shared(nbytes, float_size, comm=shared_comm)
# Construct the array
buf, itemsize = win.Shared_query(0)
_storedZModes = np.ndarray(buffer=buf, dtype='d', shape=size)
# Fill the shared array with only the leader rank
if is_leader:
_storedZModes[...] = np.ones(size)
shared_comm.Barrier()
# Access the array - if we don't do this, then memory usage is as expected. If I do this, then I find that memory usage goes up to twice the size, as if it's copying the array on access
if shared_comm.rank == 1:
# Do a (bad) explicit sum to make clear it is not a copy problem within numpy sum()
SUM = 0.
for i in range(_nModes):
for j in range(_nSamples):
for k in range(_nSamples):
SUM = SUM + _storedZModes[i,j,k]
# Wait for a while to make sure slurm notices any issues before finishing
time.sleep(500)
通过上面的设置,共享阵列大约需要2.3GB,这在运行代码和查询时已确认。如果我通过在单个节点上的4个内核上通过slurm提交队列,每个进程有0.75GB,则只有在不执行求和操作时,它才能正常运行。但是,如果求和(如所示,或使用np.sum或类似方法),则slurm会抱怨超出了内存使用率。如果领导者级别进行总和,则不会发生这种情况。
每个进程0.75GB,分配的总内存为3GB,这将为共享阵列以外的所有其他空间提供约0.6GB的内存。显然应该足够。
似乎在除领导者之外的任何进程上访问内存都在复制内存,这显然是无用的。我做错了什么吗?
编辑
我玩过栅栏,并按如下方式使用put / get。我仍然得到相同的行为。如果有人运行此程序并且没有复制问题,那对我来说仍然是有用的信息:)
from mpi4py import MPI
import numpy as np
import time
import sys
shared_comm = MPI.COMM_WORLD.Split_type(MPI.COMM_TYPE_SHARED)
print("Shared comm contains: ", shared_comm.Get_size(), " processes")
shared_comm.Barrier()
leader_rank = 0
is_leader = shared_comm.rank == leader_rank
# Set up a large array as example
_nModes = 45
_nSamples = 512*5
float_size = MPI.DOUBLE.Get_size()
print("COMM has ", shared_comm.Get_size(), " processes")
size = (_nModes, _nSamples, _nSamples)
if is_leader:
total_size = np.prod(size)
nbytes = total_size * float_size
print("Expected array size is ", nbytes/(1024.**3), " GB")
else:
nbytes = 0
# Create the shared memory, or get a handle based on shared communicator
shared_comm.Barrier()
win = MPI.Win.Allocate_shared(nbytes, float_size, comm=shared_comm)
# Construct the array
buf, itemsize = win.Shared_query(leader_rank)
_storedZModes = np.ndarray(buffer=buf, dtype='d', shape=size)
# Fill the shared array with only the leader rank
win.Fence()
if is_leader:
print("RANK: ", shared_comm.Get_rank() , " is filling the array ")
#_storedZModes[...] = np.ones(size)
win.Put(np.ones(size), leader_rank, 0)
print("RANK: ", shared_comm.Get_rank() , " SUCCESSFULLY filled the array ")
print("Sum should return ", np.prod(size))
win.Fence()
# Access the array - if we don't do this, then memory usage is as expected. If I do this, then I find that memory usage goes up to twice t
he size, as if it's copying the array on access
if shared_comm.rank == 1:
print("RANK: ", shared_comm.Get_rank() , " is querying the array "); sys.stdout.flush()
# Do a (bad) explicit sum to make clear it is not a copy problem within numpy sum()
SUM = 0.
counter = -1; tSUM = np.empty((1,))
for i in range(_nModes):
for j in range(_nSamples):
for k in range(_nSamples):
if counter%10000 == 0:
print("Finished iteration: ", counter); sys.stdout.flush()
counter += 1; win.Get(tSUM, leader_rank, counter); SUM += tSUM[0];
#SUM = SUM + _storedZModes[i,j,k]
print("RANK: ", shared_comm.Get_rank() , " SUCCESSFULLY queried the array ", SUM)
shared_comm.Barrier()
# Wait for a while to make sure slurm notices any issues before finishing
time.sleep(500)
答案
进一步的调查明确表明,问题出在流浪汉中:一个有效地告诉流浪汉忽略共享内存的开关已关闭,并将其打开解决了这个问题。
在可接受的答案中给出了为什么会导致问题的描述。本质上,slurm会计算两个进程的总驻留内存。
答案 0 :(得分:0)
我用两个MPI任务运行了这个任务,并用top
和pmap
监视了两个任务。
这些工具表明
_storedZModes[...] = np.ones(size)
分配了一个充满1
的缓冲区,因此领导者所需的内存确实为2 * nbytes
(驻留内存为2 * nbytes
,其中共享内存中包含nbytes
)
来自top
top - 15:14:54 up 43 min, 4 users, load average: 2.76, 1.46, 1.18
Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 27.5 us, 6.2 sy, 0.0 ni, 66.2 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 3881024 total, 161624 free, 2324936 used, 1394464 buff/cache
KiB Swap: 839676 total, 818172 free, 21504 used. 1258976 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6390 gilles 20 0 2002696 20580 7180 R 100.0 0.5 1:00.39 python
6389 gilles 20 0 3477268 2.5g 1.1g D 12.3 68.1 0:02.41 python
此操作完成后,将释放充满1
的缓冲区,并将内存降为nbytes
(驻留内存〜=共享内存)
请注意,那时任务1的常驻内存和共享内存都很小。
top - 15:14:57 up 43 min, 4 users, load average: 2.69, 1.47, 1.18
Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 27.2 us, 1.3 sy, 0.0 ni, 71.3 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 3881024 total, 1621860 free, 848848 used, 1410316 buff/cache
KiB Swap: 839676 total, 818172 free, 21504 used. 2735168 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6390 gilles 20 0 2002696 20580 7180 R 100.0 0.5 1:03.39 python
6389 gilles 20 0 2002704 1.1g 1.1g S 2.0 30.5 0:02.47 python
在任务1上计算总和后,常驻内存和共享内存都增加到nbytes
。
top - 15:18:09 up 46 min, 4 users, load average: 0.33, 1.01, 1.06
Tasks: 2 total, 0 running, 2 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.4 us, 2.9 sy, 0.0 ni, 88.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 3881024 total, 1297172 free, 854460 used, 1729392 buff/cache
KiB Swap: 839676 total, 818172 free, 21504 used. 2729768 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6389 gilles 20 0 2002704 1.1g 1.1g S 0.0 30.6 0:02.48 python
6390 gilles 20 0 2002700 1.4g 1.4g S 0.0 38.5 2:34.42 python
最后,top
报告了两个进程,它们的驻留内存大约为nbytes
,这大约是共享内存中同一nbytes
的单个映射。
我不知道SLURM如何测量内存消耗...
如果它正确地考虑了共享内存,那么它应该很好(例如分配nbytes
)。
但是,如果忽略它,它将考虑为您的作业分配了2 * nbytes
的(驻留)内存,这可能太多了。
请注意,如果您将初始化替换为
if is_leader:
for i in range(_nModes):
for j in range(_nSamples):
for k in range(_nSamples):
_storedZModes[i,j,k] = 1
未分配填充1
的临时缓冲区,等级0的最大内存消耗为nbytes
而不是2 * nbytes
。