在共享内存中使用numpy数组进行多处理

时间:2011-10-25 19:34:32

标签: python numpy multiprocessing shared

我想在共享内存中使用numpy数组,以便与多处理模块一起使用。困难在于使用它就像一个numpy数组,而不仅仅是一个ctypes数组。

from multiprocessing import Process, Array
import scipy

def f(a):
    a[0] = -a[0]

if __name__ == '__main__':
    # Create the array
    N = int(10)
    unshared_arr = scipy.rand(N)
    arr = Array('d', unshared_arr)
    print "Originally, the first two elements of arr = %s"%(arr[:2])

    # Create, start, and finish the child processes
    p = Process(target=f, args=(arr,))
    p.start()
    p.join()

    # Printing out the changed values
    print "Now, the first two elements of arr = %s"%arr[:2]

这会产生如下输出:

Originally, the first two elements of arr = [0.3518653236697369, 0.517794725524976]
Now, the first two elements of arr = [-0.3518653236697369, 0.517794725524976]

可以以ctypes方式访问数组,例如arr[i]是有道理的。但是,它不是一个numpy数组,我无法执行-1*arrarr.sum()等操作。我想解决方案是将ctypes数组转换为numpy数组。然而(除了无法完成这项工作),我不相信它会再被分享。

似乎会有一个标准解决方案来解决常见问题。

5 个答案:

答案 0 :(得分:70)

添加到@ unutbu(不再可用)和@Henry Gomersall的答案。您可以使用shared_arr.get_lock()在需要时同步访问权限:

shared_arr = mp.Array(ctypes.c_double, N)
# ...
def f(i): # could be anything numpy accepts as an index such another numpy array
    with shared_arr.get_lock(): # synchronize access
        arr = np.frombuffer(shared_arr.get_obj()) # no data copying
        arr[i] = -arr[i]

实施例

import ctypes
import logging
import multiprocessing as mp

from contextlib import closing

import numpy as np

info = mp.get_logger().info

def main():
    logger = mp.log_to_stderr()
    logger.setLevel(logging.INFO)

    # create shared array
    N, M = 100, 11
    shared_arr = mp.Array(ctypes.c_double, N)
    arr = tonumpyarray(shared_arr)

    # fill with random values
    arr[:] = np.random.uniform(size=N)
    arr_orig = arr.copy()

    # write to arr from different processes
    with closing(mp.Pool(initializer=init, initargs=(shared_arr,))) as p:
        # many processes access the same slice
        stop_f = N // 10
        p.map_async(f, [slice(stop_f)]*M)

        # many processes access different slices of the same array
        assert M % 2 # odd
        step = N // 10
        p.map_async(g, [slice(i, i + step) for i in range(stop_f, N, step)])
    p.join()
    assert np.allclose(((-1)**M)*tonumpyarray(shared_arr), arr_orig)

def init(shared_arr_):
    global shared_arr
    shared_arr = shared_arr_ # must be inherited, not passed as an argument

def tonumpyarray(mp_arr):
    return np.frombuffer(mp_arr.get_obj())

def f(i):
    """synchronized."""
    with shared_arr.get_lock(): # synchronize access
        g(i)

def g(i):
    """no synchronization."""
    info("start %s" % (i,))
    arr = tonumpyarray(shared_arr)
    arr[i] = -1 * arr[i]
    info("end   %s" % (i,))

if __name__ == '__main__':
    mp.freeze_support()
    main()

如果您不需要同步访问或创建自己的锁,则不需要mp.Array()。在这种情况下,您可以使用mp.sharedctypes.RawArray

答案 1 :(得分:17)

Array对象具有与之关联的get_obj()方法,该方法返回呈现缓冲区接口的ctypes数组。我认为以下内容应该有效......

from multiprocessing import Process, Array
import scipy
import numpy

def f(a):
    a[0] = -a[0]

if __name__ == '__main__':
    # Create the array
    N = int(10)
    unshared_arr = scipy.rand(N)
    a = Array('d', unshared_arr)
    print "Originally, the first two elements of arr = %s"%(a[:2])

    # Create, start, and finish the child process
    p = Process(target=f, args=(a,))
    p.start()
    p.join()

    # Print out the changed values
    print "Now, the first two elements of arr = %s"%a[:2]

    b = numpy.frombuffer(a.get_obj())

    b[0] = 10.0
    print a[0]

运行时,会打印出a的第一个元素,现在为10.0,显示ab只是同一个内存中的两个视图。

为了确保它仍然是多处理器安全的,我相信您必须使用acquire对象release上存在的Arraya方法,它内置锁定以确保所有安全访问(虽然我不是多处理器模块的专家)。

答案 2 :(得分:12)

虽然已经给出的答案很好,但只要满足两个条件,就可以更容易地解决这个问题:

  1. 您使用的是符合POSIX标准的操作系统(例如Linux,Mac OSX);和
  2. 您的子流程需要只读访问到共享阵列。
  3. 在这种情况下,您不需要明确地使变量共享,因为子进程将使用fork创建。分叉子项自动共享父项的内存空间。在Python多处理的上下文中,这意味着它共享所有模块级变量;请注意,对于您明确传递给子进程或您在Traceback (most recent call last): File "overlap.py", line 210, in <module> print bdv_json_to_geodf(contours_bdv) File "overlap.py", line 148, in json_to_geodf return geopandas.GeoDataFrame.from_features(collection['features']) File "/Library/Python/2.7/site-packages/geopandas/geodataframe.py", line 179, in from_features d = {'geometry': shape(f['geometry'])} File "/Library/Frameworks/GEOS.framework/Versions/3/Python/2.7/site-packages/shapely/geometry/geo.py", line 40, in shape return MultiPolygon(ob["coordinates"], context_type='geojson') File "/Library/Frameworks/GEOS.framework/Versions/3/Python/2.7/site-packages/shapely/geometry/multipolygon.py", line 64, in __init__ self._geom, self._ndim = geos_multipolygon_from_py(polygons) File "/Library/Frameworks/GEOS.framework/Versions/3/Python/2.7/site-packages/shapely/geometry/multipolygon.py", line 138, in geos_multipolygon_from_py N = len(ob[0][0][0]) TypeError: object of type 'float' has no len() 左右调用的函数的参数,此不包含

    一个简单的例子:

    multiprocessing.Pool

答案 3 :(得分:9)

我编写了一个小python模块,它使用POSIX共享内存在python解释器之间共享numpy数组。也许你会发现它很方便。

https://pypi.python.org/pypi/SharedArray

以下是它的工作原理:

import numpy as np
import SharedArray as sa

# Create an array in shared memory
a = sa.create("test1", 10)

# Attach it as a different array. This can be done from another
# python interpreter as long as it runs on the same computer.
b = sa.attach("test1")

# See how they are actually sharing the same memory block
a[0] = 42
print(b[0])

# Destroying a does not affect b.
del a
print(b[0])

# See how "test1" is still present in shared memory even though we
# destroyed the array a.
sa.list()

# Now destroy the array "test1" from memory.
sa.delete("test1")

# The array b is not affected, but once you destroy it then the
# data are lost.
print(b[0])

答案 4 :(得分:8)

您可以使用sharedmem模块:https://bitbucket.org/cleemesser/numpy-sharedmem

这是你的原始代码,这次使用的行为类似于NumPy数组的共享内存(注意调用NumPy sum()函数的附加最后一个语句):

from multiprocessing import Process
import sharedmem
import scipy

def f(a):
    a[0] = -a[0]

if __name__ == '__main__':
    # Create the array
    N = int(10)
    unshared_arr = scipy.rand(N)
    arr = sharedmem.empty(N)
    arr[:] = unshared_arr.copy()
    print "Originally, the first two elements of arr = %s"%(arr[:2])

    # Create, start, and finish the child process
    p = Process(target=f, args=(arr,))
    p.start()
    p.join()

    # Print out the changed values
    print "Now, the first two elements of arr = %s"%arr[:2]

    # Perform some NumPy operation
    print arr.sum()