Question

我希望使用多处理，其中一个参数是一个非常大的numpy数组。我已经研究了其他似乎有类似问题的帖子

Large numpy arrays in shared memory for multiprocessing: Is sth wrong with this approach?

Share Large, Read-Only Numpy Array Between Multiprocessing Processes

但是对python来说还是新手，我一直无法在这个模板中调整解决方案。我想知道我是否可以请求你的帮助来理解我的选择是什么，以便以只读的方式将X传递给函数。我简化的代码片段在这里：

import multiprocessing as mp
import numpy as np

def funcA(X):
    # do something with X
    print 'funcA OK'

def funcB(X):
    # do something else with X
    print 'funcB OK'

if __name__=='__main__':
    X=np.random.rand(int(5.00e8),)
    funcA(X) # OK
    funcB(X) # OK
    X=np.random.rand(int(2.65e8),)
    P=[]
    P.append(mp.Process(target=funcA,args=(X,))) # OK
    P.append(mp.Process(target=funcB,args=(X,))) # OK
    for p in P:
        p.start()

    for p in P:
        p.join()

    X=np.random.rand(int(2.70e8),)
    P=[]
    P.append(mp.Process(target=funcA,args=(X,))) # FAIL
    P.append(mp.Process(target=funcB,args=(X,))) # FAIL
    for p in P:
        p.start()

    for p in P:
        p.join()

当顺序调用时，funcA和funcB似乎接受非常大的numpy数组。但是，如果将它们作为多进程调用，则可以传递给函数的numpy数组的大小似乎存在大小上限。我怎么能最好地解决这个问题？

注意：

0）我不想修改X;只读它;

1）我正在运行64位Windows 7专业版

Answer 1

问题可能在于向子进程的数据传输。当必须使用只读对象时，我更喜欢利用底层操作系统使用的copy-on-write机制来管理子进程的内存。但是，我不知道Windows 7是否使用此机制。当copy-on-write可用时，您可以访问父进程的区域，而无需在子进程内复制它们。只有在以只读方式访问它们并且在创建进程之前创建对象时，此技巧才有效。

总结一下，可能的解决方案（至少对于linux机器）是这样的：

import multiprocessing as mp
import numpy as np

def funcA():
    print "A {}".format(X.shape)
    # do something with the global variable X
    print 'funcA OK'

def funcB():
    print "B {}".format(X.shape)
    # do something else with the global variable X
    print 'funcB OK'

if __name__=='__main__':
    X=np.random.rand(int(5.00e8),)
    funcA() # OK
    funcB() # OK

    X=np.random.rand(int(2.65e8),)
    P=[mp.Process(target=funcA), mp.Process(target=funcB)]
    for p in P:
        p.start()

    for p in P:
        p.join()

    X=np.random.rand(int(2.70e8),)
    P=[mp.Process(target=funcA), mp.Process(target=funcB)]
    for p in P:
        p.start()

    for p in P:
        p.join()

更新：在关于Windows的兼容性问题的各种评论之后，我根据本机内存映射绘制了唯一的新解决方案。在这个解决方案中，我在文件上创建一个numpy内存映射，它通过文件描述符共享，因此它不需要复制子内的整个数组。我发现这个解决方案比使用multiprocessing.Array！

快得多

UPDATE2：以下代码已更新，以避免在内存映射随机化期间出现内存问题。

import multiprocessing as mp
import numpy as np
import tempfile

def funcA(X):
    print "A {}".format(X.shape)
    # do something with X
    print 'funcA OK'

def funcB(X):
    print "B {}".format(X.shape)
    # do something else with X
    print 'funcB OK'

if __name__=='__main__':
    dim = int(2.75e8)
    with tempfile.NamedTemporaryFile(dir='/tmp', delete=False) as tmpfile:
        X = np.memmap(tmpfile, shape=dim, dtype=np.float32)  # create the memory map
        # init the map by chunks of size 1e8
        max_chunk_size = int(1e8)
        for start_pos in range(0, dim, max_chunk_size):
            chunk_size = min(dim-start_pos, max_chunk_size)
            X[start_pos:start_pos+chunk_size] = np.random.rand(chunk_size,)
        P=[mp.Process(target=funcA, args=(X,)), mp.Process(target=funcB, args=(X,))]
        for p in P:
            p.start()

        for p in P:
            p.join()

使用大型numpy数组作为参数进行多处理

1 个答案: