我正在尝试使用mpi4py在大型numpy数组上并行化一些操作。我目前正在使用numpy.array_split
将数组划分为块,然后com.scatter
将数组发送到不同的内核,然后comm.gather
收集生成的数组。下面是一个最小的(非)工作示例:
import numpy as np
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
if rank == 0:
test = np.random.rand(411,48,52,40)
test_chunks = np.array_split(test,size,axis=0)
else:
test_chunks = None
test_chunk = comm.scatter(test_chunks,root=0)
output_chunk = np.zeros([np.shape(test_chunk)[0],128,128,128])
for i in range(0,np.shape(test_chunk)[0],1):
print(i)
output_chunk[i,0:48,0:52,0:40] = test_chunk[i]
outputData = comm.gather(output_chunk,root=0)
if rank == 0:
outputData = np.concatenate(outputData,axis = 0)
运行此操作会给我一个错误:
File "test_4d.py", line 23, in <module>
outputData = comm.gather(output_chunk,root=0)
File "Comm.pyx", line 869, in mpi4py.MPI.Comm.gather (src/mpi4py.MPI.c:73266)
File "pickled.pxi", line 614, in mpi4py.MPI.PyMPI_gather (src/mpi4py.MPI.c:33592)
File "pickled.pxi", line 146, in mpi4py.MPI._p_Pickle.allocv (src/mpi4py.MPI.c:28517)
File "pickled.pxi", line 95, in mpi4py.MPI._p_Pickle.alloc (src/mpi4py.MPI.c:27832)
SystemError: Negative size passed to PyString_FromStringAndSize
这个错误似乎是由聚集收集的大小数量的numpy数组引起的;由于scatter和gather将数组作为数组列表发送,因此很容易超出列表大小。我遇到的一个建议是使用comm.Scatter和comm.Gather。但是,我正在努力为这些功能找到明确的文档,到目前为止还无法成功实现它们。例如:
替换
outputData = comm.gather(output_chunk,root=0)
行
outputData=comm.Gather(sendbuf[test_chunks,MPI.DOUBLE],recvbuf=output_chunk,MPI.DOUBLE],root=0)
给出错误:
File "Comm.pyx", line 415, in mpi4py.MPI.Comm.Gather (src/mpi4py.MPI.c:66916)
File "message.pxi", line 426, in mpi4py.MPI._p_msg_cco.for_gather (src/mpi4py.MPI.c:23559)
File "message.pxi", line 355, in mpi4py.MPI._p_msg_cco.for_cco_send (src/mpi4py.MPI.c:22959)
File "message.pxi", line 111, in mpi4py.MPI.message_simple (src/mpi4py.MPI.c:20516)
File "message.pxi", line 51, in mpi4py.MPI.message_basic (src/mpi4py.MPI.c:19644)
File "asbuffer.pxi", line 108, in mpi4py.MPI.getbuffer (src/mpi4py.MPI.c:6757)
File "asbuffer.pxi", line 50, in mpi4py.MPI.PyObject_GetBufferEx (src/mpi4py.MPI.c:6093)
TypeError: expected a readable buffer object
或使用以下行:
outputData = comm.Gather(sendbuf=test_chunks, recvbuf=output_chunk,root=0)
给出错误:
File "test_4d_2.py", line 24, in <module>
outputData = comm.Gather(sendbuf=test_chunks, recvbuf=output_chunk,root=0)
File "Comm.pyx", line 415, in mpi4py.MPI.Comm.Gather (src/mpi4py.MPI.c:66916)
File "message.pxi", line 426, in mpi4py.MPI._p_msg_cco.for_gather (src/mpi4py.MPI.c:23559)
File "message.pxi", line 355, in mpi4py.MPI._p_msg_cco.for_cco_send (src/mpi4py.MPI.c:22959)
File "message.pxi", line 111, in mpi4py.MPI.message_simple (src/mpi4py.MPI.c:20516)
File "message.pxi", line 60, in mpi4py.MPI.message_basic (src/mpi4py.MPI.c:19747)
TypeError: unhashable type: 'numpy.ndarray'
此外,输入矩阵test
的大小也可能会增加,这可能会导致comm.scatter
出现类似的问题。除了comm.Gather
已有的问题之外,我不知道如何设置comm.Scatter
,因为recvbuf
是根据test_chunk
的大小定义的,这是comm.scatter
输出recvbuf
,因此我无法在comm.Scatter
中指定import org.apache.spark.sql.SQLContext
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.load("cars.csv");
df.select("year", "model").write()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.save("newcars.csv");
。
答案 0 :(得分:1)
解决方案是使用comm.Scatterv
和comm.Gatherv
来发送和接收数据作为内存块,而不是numpy数组列表,解决数据大小问题。 comm.Scatterv
和comm.Gatherv
假设内存中的C顺序(行主要)数据块,并且必须指定两个向量sendcounts
和displacements
。 Sendcounts
给出分割输入数据的位置的整数值(索引)(即每个向量的起始点发送到给定的核心),而displacements
给出该向量的长度。因此,可以改变发送到每个核心的数据量。更多细节可以在这里找到:http://materials.jeremybejarano.com/MPIwithPython/collectiveCom.html
此处给出了使用comm.Scatterv
和comm.Gatherv
表示2D矩阵的示例:
Along what axis does mpi4py Scatterv function split a numpy array?