我一直在处理MPI4py阵列,最近在使用Scatterv()
函数后遇到了性能提升。我已经开发了一个代码来检查输入对象的数据类型,并且如果它是数字numpy数组,它会使用Scatterv()
进行散射,否则会使用适当实现的函数来进行散射。
代码如下:
import numpy as np
from mpi4py import MPI
import cProfile
from line_profiler import LineProfiler
def ScatterV(object, comm, root = 0):
optimize_scatter, object_type = np.zeros(1), None
if rank == root:
if isinstance(object, np.ndarray):
if object.dtype in [np.float64, np.float32, np.float16, np.float,
np.int, np.int8, np.int16, np.int32, np.int64]:
optimize_scatter = 1
object_type = object.dtype
else: optimize_scatter, object_type = 0, None
else: optimize_scatter, object_type = 0, None
optimize_scatter = np.array(optimize_scatter, dtype=np.float64).ravel()
comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
object_type = comm.bcast(object_type, root=root)
if int(optimize_scatter) == 1:
if rank == root:
displs = [int(i)*object.shape[1] for i in
np.linspace(0, object.shape[0], comm.size + 1)]
counts = [displs[i+1] - displs[i] for i in range(len(displs)-1)]
lens = [int((displs[i+1] - displs[i])/(object.shape[1]))
for i in range(len(displs)-1)]
displs = displs[:-1]
shape = object.shape
object = object.ravel().astype(np.float64, copy=False)
else:
object, counts, displs, shape, lens = None, None, None, None, None
counts = comm.bcast(counts, root=root)
displs = comm.bcast(displs, root=root)
lens = comm.bcast(lens, root=root)
shape = list(comm.bcast(shape, root=root))
shape[0] = lens[rank]
shape = tuple(shape)
x = np.zeros(counts[rank])
comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)
return np.reshape(x, (-1,) + shape[1:]).astype(object_type, copy=False)
else:
return comm.scatter(object, root=root)
comm = MPI.COMM_WORLD
size, rank = comm.Get_size(), comm.Get_rank()
if rank == 0:
arra = (np.random.rand(10000000, 10) * 100).astype(np.float64, copy=False)
else: arra = None
lp = LineProfiler()
lp_wrapper = lp(ScatterV)
lp_wrapper(arra, comm)
if rank == 4: lp.print_stats()
pr = cProfile.Profile()
pr.enable()
f2 = ScatterV(arra, comm)
pr.disable()
if rank == 4: pr.print_stats()
使用LineProfiler
进行的分析会产生以下结果(仅显示为冲突线):
Timer unit: 1e-06 s
Total time: 2.05001 s
File: /media/SETH_DATA/SETH_Alex/BigMPI4py/prueba.py
Function: ScatterV at line 26
Line # Hits Time Per Hit % Time Line Contents
==============================================================
...
41 1 1708453.0 1708453.0 83.3 comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
42 1 148.0 148.0 0.0 object_type = comm.bcast(object_type, root=root)
...
76 1 264.0 264.0 0.0 counts = comm.bcast(counts, root=root)
77 1 16.0 16.0 0.0 displs = comm.bcast(displs, root=root)
78 1 14.0 14.0 0.0 lens = comm.bcast(lens, root=root)
79 1 9.0 9.0 0.0 shape = list(comm.bcast(shape, root=root))
...
86 1 340971.0 340971.0 16.6 comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)
使用cProfile
进行的分析得出以下结果:
17 function calls in 0.462 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.127 0.127 0.127 0.127 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
1 0.335 0.335 0.335 0.335 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
5 0.000 0.000 0.000 0.000 {method 'bcast' of 'mpi4py.MPI.Comm' objects}
在两种情况下,与Bcast
方法相比,ScatterV
方法要花费大量时间。使用LinePprofiler
时,Bcast
方法要比ScatterV
方法慢5倍,这对我来说似乎是完全不连贯的,因为Bcast
仅广播10个元素的数组。
如果我交换41和42行,则结果如下:
LineProfiler
41 1 1666718.0 1666718.0 83.0 object_type = comm.bcast(object_type, root=root)
42 1 47.0 47.0 0.0 comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
87 1 341728.0 341728.0 17.0 comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)
cProfile
1 0.000 0.000 0.000 0.000 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
1 0.339 0.339 0.339 0.339 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
5 0.129 0.026 0.129 0.026 {method 'bcast' of 'mpi4py.MPI.Comm' objects}
如果我改变要分散的数组的大小,则ScatterV
和Bcast
的时间消耗也将以相同的速率变化。例如,如果我将大小增加10倍(100000000),结果将是:
LineProfiler
41 1 16304301.0 16304301.0 82.8 comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
42 1 235.0 235.0 0.0 object_type = comm.bcast(object_type, root=root)
87 1 3393658.0 3393658.0 17.2 comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)
cProfile
1 1.348 1.348 1.348 1.348 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
1 4.517 4.517 4.517 4.517 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
5 0.000 0.000 0.000 0.000 {method 'bcast' of 'mpi4py.MPI.Comm' objects}
如果不是为等级4选择结果,而是为大于1的任何等级选择它们,则会发生相同的结果。但是,对于等级= 0,结果不同:
LineProfiler
41 1 186.0 186.0 0.0 comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
42 1 244.0 244.0 0.0 object_type = comm.bcast(object_type, root=root)
87 1 4722349.0 4722349.0 100.0 comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)
cProfile
1 0.000 0.000 0.000 0.000 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
1 5.921 5.921 5.921 5.921 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
5 0.000 0.000 0.000 0.000 {method 'bcast' of 'mpi4py.MPI.Comm' objects}
在这种情况下,Bcast
方法的计算时间与其余bcast
方法的计算时间相似。
我也尝试过使用Bcast
和bcast
而不是在第41行上使用scatter
,它们会产生相同的结果。
鉴于此,我认为增加的时间消耗仅错误地归因于第一次广播,这意味着两个探查器都为并行化过程产生了错误的计时。
我很确定探查器的内部结构不会与可并行化的功能一起使用,但是我发布了这个问题,以了解是否有人遇到过类似的结果。
答案 0 :(得分:0)
为回应Gilles Gouaillardet,我在每个comm.Barrier()
调用之前和之后的行中都包含了bcast
,并且在这些comm.Barrier()
调用中总结了大部分信号。
以下是LineProfiler
的示例。
Timer unit: 1e-06 s
Total time: 2.17248 s
File: /media/SETH_DATA/SETH_Alex/BigMPI4py/prueba.py
Function: ScatterV at line 26
Line # Hits Time Per Hit % Time Line Contents
==============================================================
26 def ScatterV(object, comm, root = 0):
27 1 7.0 7.0 0.0 optimize_scatter, object_type = np.zeros(1), None
28
29 1 2.0 2.0 0.0 if rank == root:
30 if isinstance(object, np.ndarray):
31 if object.dtype in [np.float64, np.float32, np.float16, np.float,
32 np.int, np.int8, np.int16, np.int32, np.int64]:
33 optimize_scatter = 1
34 object_type = object.dtype
35
36 else: optimize_scatter, object_type = 0, None
37 else: optimize_scatter, object_type = 0, None
38
39 optimize_scatter = np.array(optimize_scatter, dtype=np.float64).ravel()
40
41 1 1677662.0 1677662.0 77.2 comm.Barrier()
42 1 76.0 76.0 0.0 comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
43 1 345.0 345.0 0.0 comm.Barrier()
44 1 111.0 111.0 0.0 object_type = comm.bcast(object_type, root=root)
45 1 166.0 166.0 0.0 comm.Barrier()
46
47
48
49 1 7.0 7.0 0.0 if int(optimize_scatter) == 1:
50
51 1 2.0 2.0 0.0 if rank == root:
52 if object.ndim > 1:
53 displs = [int(i)*object.shape[1] for i in
54 np.linspace(0, object.shape[0], comm.size + 1)]
55 else:
56 displs = [int(i) for i in np.linspace(0, object.shape[0], comm.size + 1)]
57
58 counts = [displs[i+1] - displs[i] for i in range(len(displs)-1)]
59
60 if object.ndim > 1:
61 lens = [int((displs[i+1] - displs[i])/(object.shape[1]))
62 for i in range(len(displs)-1)]
63 else:
64 lens = [displs[i+1] - displs[i] for i in range(len(displs)-1)]
65
66 displs = displs[:-1]
67
68
69 shape = object.shape
70
71
72
73 if object.ndim > 1:
74 object = object.ravel().astype(np.float64, copy=False)
75
76
77 else:
78 1 2.0 2.0 0.0 object, counts, displs, shape, lens = None, None, None, None, None
79
80 1 295.0 295.0 0.0 counts = comm.bcast(counts, root=root)
81 1 66.0 66.0 0.0 displs = comm.bcast(displs, root=root)
82 1 6.0 6.0 0.0 lens = comm.bcast(lens, root=root)
83 1 9.0 9.0 0.0 shape = list(comm.bcast(shape, root=root))
84
85 1 2.0 2.0 0.0 shape[0] = lens[rank]
86 1 3.0 3.0 0.0 shape = tuple(shape)
87
88 1 33.0 33.0 0.0 x = np.zeros(counts[rank])
89
90 1 76.0 76.0 0.0 comm.Barrier()
91 1 351187.0 351187.0 16.2 comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)
92 1 142352.0 142352.0 6.6 comm.Barrier()
93
94 1 5.0 5.0 0.0 if len(shape) > 1:
95 1 66.0 66.0 0.0 return np.reshape(x, (-1,) + shape[1:]).astype(object_type, copy=False)
96 else:
97 return x.view(object_type)
98
99
100 else:
101 return comm.scatter(object, root=root)
这77.2%的时间花费在第一个comm.Barrier()
元素上,因此我可以安全地假设没有一个bcast
调用花费了如此多的时间。我会考虑添加comm.Barrier()
个调用以供将来分析。