我想将2D-numpy数组拆分为x个部分,并通过 mpi4py 将其发送到x个不同的进程,但是当我使用 scatterv 函数时,在x = 2的任务1上遇到段错误,但对于x = 4或x = 8则工作正常。
因此numpy数组的尺寸为(151789810,9)和dtype float32,我想将其沿0轴散布。该节点有足够的内存(512 GB)。
我正在使用:
这就是我的小“散布类”中的我的小“散布方法”:
def distribute(self):
#
#self.data is the numpy array with shape (151789810, 9) and dtype float32
#
#
#Broadcast scatter params (prev. calculated)
#
self.displacements_input = self.comm.bcast(self.displacements_input, root=0)
self.split_sizes_input = self.comm.bcast(self.split_sizes_input,root=0)
self.split_shapes = self.comm.bcast(self.split_shapes,root = 0)
#
#Alloc. Target
#
self.chunk = np.zeros((self.split_shapes[self.rank]),dtype = np.float32)
self.comm.barrier()
#
#Print Info
#
if self.rank == 0:
print(self.data.shape)
print(self.data.dtype)
print("dis: ",self.displacements_input)
print("sizes: ",self.split_sizes_input)
print("shapes: ",self.split_shapes)
print("Chunk of rank {} has shape {} and dtype {}".format(self.rank,self.chunk.shape,self.chunk.dtype))
#
#Call scatterv
#
#
#And here is the segfault
#
self.comm.Scatterv([self.data,self.split_sizes_input, self.displacements_input,self.MPI_obj.FLOAT],self.chunk,root=0)
两个过程的输出:
#(151789810, 9)
#float32
#dis: [ 0 683054145]
#sizes: [683054145 683054145]
#shapes: [(75894905, 9), (75894905, 9)]
#Chunk of rank 0 has shape (75894905, 9) and dtype float32
#Chunk of rank 1 has shape (75894905, 9) and dtype float32
...我得到:
#srun: error: nodename: task 0: Terminated
#srun: error: nodename: task 1: Segmentation fault
四个过程的输出:
#(151789810, 9)
#float32
#dis: [ 0 341527077 683054154 1024581222]
#sizes: [341527077 341527077 341527068 341527068]
#shapes: [(37947453, 9), (37947453, 9), (37947452, 9), (37947452, 9)]
#Chunk of rank 0 has shape (37947453, 9) and dtype float32
#Chunk of rank 1 has shape (37947453, 9) and dtype float32
#Chunk of rank 2 has shape (37947452, 9) and dtype float32
#Chunk of rank 3 has shape (37947452, 9) and dtype float32
...一切都很好。
因此,这是一个(希望)再现错误的小示例。 numpy数组的维数可以由 first_dim_n 和 second_dim_n 定义。可以通过调整 srun 参数来更改进程数。
我知道该程序适用于(1000,9)之类的“小”数组,但我对 large 内存中的 big 数组感兴趣。因此,请确保您具有类似的 ratio 。如果仍然有效,则该错误可能是任何地方 ...
对于@GillesGouaillardet,一切正常。我现在正在寻找任何有用的调试信息/崩溃报告...
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
import numpy as np
def get_split_sizes_and_displacements(arr,indices_or_sections,axis = 0):
Ntotal = arr.shape[axis]
Nsections = int(indices_or_sections)
Neach_section, extras = divmod(Ntotal, Nsections)
section_sizes = ([0] +
extras * [Neach_section+1] +
(Nsections-extras) * [Neach_section])
feature_size = arr.shape[1]
div_points = np.array(section_sizes).cumsum()
div_points *= feature_size
displacements = div_points[:-1]
split_sizes = np.ediff1d(div_points)
split_shapes = list(map(lambda x:(int(x/feature_size),feature_size),split_sizes))
return split_sizes, displacements, split_shapes
if __name__ == '__main__':
if rank == 0:
first_dim_n = 151789810
second_dim_n = 9
data = np.random.rand(first_dim_n,second_dim_n).astype(np.float32)
split_sizes, displacements, split_shapes = get_split_sizes_and_displacements(data,size)
else:
split_sizes = None
displacements = None
split_shapes = None
data = None
split_sizes_input = comm.bcast(split_sizes, root = 0)
displacements_input = comm.bcast(displacements, root = 0)
split_shapes_input = comm.bcast(split_shapes, root = 0)
comm.barrier()
chunk = np.zeros(split_shapes_input[rank],dtype=np.float32)
comm.Scatterv([data,split_sizes_input, displacements_input,MPI.FLOAT],chunk,root=0)
if rank == 0:
print(data.shape)
print("rank {} has shape {}".format(rank,chunk.shape))
两个过程的输出:
srun: error: nodename: task 0: Terminated
srun: error: nodename: task 1: Segmentation fault
四个过程的输出:
(151789810, 9)
rank 0 has shape (37947453, 9)
rank 1 has shape (37947453, 9)
rank 2 has shape (37947452, 9)
rank 3 has shape (37947452, 9)
一切都很好。
因此有必要说服您我的代码正在执行我想要的操作。 很好,我会尝试。
让我们定义一个任意形状的数组。
(7,3)
它看起来可能像这样:
a = array([[ 0, 11],
[13, 6],
[ 1, 9],
[ 3, 14],
[ 4, 8],
[ 9, 16],
[ 3, 17]])
第一个轴向下,第二个轴右侧。我们想将数组沿第一轴分为两部分-向下。那是模棱两可的。因此,我们引入了惯例,即更高级别的进程会被稍后填充。
所以等级0应该得到:
array([[ 0, 11],
[13, 6],
[ 1, 9],
[ 3, 14]])
第1级应该获得:
array([[ 4, 8],
[ 9, 16],
[ 3, 17]])
那是可行的。至少在 my 机器上。这也是我想要的。