Question

我正在尝试利用pred_generator <- function(gen) { function() { # wrap it in a function to make it callable gen()[1] # call the given generator and get the first element (i.e. samples) } } preds <- model %>% predict_generator( generator = pred_generator(test_gen), # pass test_gen directly to pred_generator without calling it steps = test_steps ) evaluate_generator(model, test_gen, test_steps)来提高程序的性能，并且我需要Pool.map(func, itr)来访问称为func的非常大的词典，以便它可以做一个缓存查找。

cache存储“ 每个前cache个整数的二进制表示形式”。

2**16

cache = {i: bin(i) for i in range(2**16 - 1)}的职责是计算传递给它的func的二进制表示形式中的1s或按位表示的数量：

int

我想做以下事情：

def func(i: int) -> int:
    return cache[i].count("1")

但是如何在每个工作程序子进程中使with Pool(8) as pool: counts = pool.map(func, [i for i in range(2**16-1)])对象cache可用？

Answer 1

天真的解决方案

一个人可以通过互联网上的以下食谱“聪明”自己：

import functools

cache = {i: bin(i) for i in range(2**16 - 1)}

def func(i: int, cache: Dict[int, str]) -> int:
    return cache[i].count("1")


with Pool(8) as pool:
    # Bind 'cache' to 'func' and pass the partial to map()
    counts = pool.map(functools.partial(func, cache=cache),
                      [i for i in range(2**16-1)])

这行得通...直到您意识到这实际上比不进行并行化要慢！，您最终在大cache的序列化/反序列化上的投入要比ROI多您可以从并行化中受益。有关更深入的说明，请参见Stuck in a Pickle。

正确的解决方案

将数据复制到Pool worker子进程的当前“最佳实践”是以某种方式使变量global。该模式如下所示：

cache = {i: bin(i) for i in range(2**16 - 1)}

def func(i: int) -> int:
    return global_cache[i].count("1")


def make_global(cache: Dict[int, str]) -> None:
    # Declare 'global_cache' to be Global
    global global_cache
    # Update 'global_cache' with a value, now *implicitly* accessible in func
    global_cache = cache


with Pool(8, initializer=make_global, initargs=(cache,)) as pool:
    counts = pool.map(func, [i for i in range(2**16-1)])

可以将相同的模式应用于面向对象的代码，将类属性替换为全局变量。 We buy a bit more encapsulation this way.

global函数正文中的make_global()'s关键字的注释：

上面的global关键字声明了一个名为global_cache的变量。从声明到程序结束为止尽管global_cache可以通过全局范围访问在函数范围内声明（，尽管这不会被“全局化” 直到派生子流程为止，将全局范围隔离到工作程序过程）。

一种（建议的）新解决方案

尽管它位于CPython分支buried deep, deep in a github repository中，但有第三个选项。

此fork提出一项功能，可让您执行以下操作：

cache = {i: bin(i) for i in range(2**16 - 1)}

def func(i: int, initret: Dict[int, str]) -> int:
    cache = initret  # Re-assign var for illustrative/readability purposes
    return cache[i].count("1")


def identity(cache: Dict[int, str]) -> Dict[int, str]:
    return cache


with Pool(8, initializer=identity, initargs=(cache,)) as pool:
    counts = pool.map(func, [i for i in range(2**16-1)])

尽管这是一个很小的变化，但它使用全局变量进行了规避，并允许父进程和工作进程之间的数据流更具可读性。 More on this here。

本质上，每次将initializer（上面的identity()）的返回值传递给func（作为名为initret的 kwarg ） func在工作进程中被调用。

注意： 我是以上所有链接博客文章的作者。

将数据传递给Python多处理池工作进程

1 个答案:

天真的解决方案

正确的解决方案

一种（建议的）新解决方案