Question

如何通过numpy.apply_along_axis()将函数应用于NumPy数组的元素并行化，以便利用多个核心？在所有对所应用函数的调用都是独立的常见情况下，这似乎是很自然的事情。

在我的特定情况下 - 如果这很重要 - 应用轴是轴0：np.apply_along_axis(func, axis=0, arr=param_grid)（np是NumPy）。

我快速浏览了 Numba ，但我似乎无法通过以下循环获得此并行化：

@numba.jit(parallel=True)
result = np.empty(shape=params.shape[1:])
for index in np.ndindex(*result.shape)):  # All the indices of params[0,...]
    result[index] = func(params[(slice(None),) + index])  # Applying func along axis 0

在NumPy 中显然还有编译选项，可以通过OpenMP进行并行化，但似乎无法通过MacPorts访问。

也可以考虑将数组切成几块并使用线程（以避免复制数据）并在每个部分上并行应用该功能。这比我正在寻找的更复杂（如果全局解释器锁没有足够的释放，可能无效）。

能够以简单的方式使用多个内核进行简单的可并行化任务，例如将函数应用于数组的所有元素（这基本上是需要的，具有函数的小复杂性{ {1}}采用一维参数数组。）

Answer 1

好吧，我解决了这个问题：一个想法是使用标准multiprocessing模块并将原始数组拆分为几个块（以便限制与工作者的通信开销）。这可以通过以下方式相对容易地完成：

import multiprocessing

import numpy as np

def parallel_apply_along_axis(func1d, axis, arr, *args, **kwargs):
    """
    Like numpy.apply_along_axis(), but takes advantage of multiple
    cores.
    """        
    # Effective axis where apply_along_axis() will be applied by each
    # worker (any non-zero axis number would work, so as to allow the use
    # of `np.array_split()`, which is only done on axis 0):
    effective_axis = 1 if axis == 0 else axis
    if effective_axis != axis:
        arr = arr.swapaxes(axis, effective_axis)

    # Chunks for the mapping (only a few chunks):
    chunks = [(func1d, effective_axis, sub_arr, args, kwargs)
              for sub_arr in np.array_split(arr, multiprocessing.cpu_count())]

    pool = multiprocessing.Pool()
    individual_results = pool.map(unpacking_apply_along_axis, chunks)
    # Freeing the workers:
    pool.close()
    pool.join()

    return np.concatenate(individual_results)

其中unpacking_apply_along_axis()中应用的函数Pool.map()是应该分开的（以便子进程可以导入它），并且只是一个处理仅Pool.map()这一事实的瘦包装器只需一个参数：

def unpacking_apply_along_axis((func1d, axis, arr, args, kwargs)):
    """
    Like numpy.apply_along_axis(), but and with arguments in a tuple
    instead.

    This function is useful with multiprocessing.Pool().map(): (1)
    map() only handles functions that take a single argument, and (2)
    this function can generally be imported from a module, as required
    by map().
    """
    return np.apply_along_axis(func1d, axis, arr, *args, **kwargs)

在我的特定情况下，这导致2个内核具有超线程的2倍加速。接近4倍的因素会更好，但速度提升已经很好了，只需几行代码就可以了，对于核心更多的机器来说它应该更好（这很常见）。也许有一种方法可以避免数据拷贝和使用共享内存（可能通过multiprocessing module本身）？

轻松并行化numpy.apply_along_axis（）？

1 个答案: