Question

我经常发现自己用Python编写程序，该程序构造了一个大（兆字节）的只读数据结构，然后使用该数据结构来分析非常大（总共数百兆）的小记录列表。可以并行分析每条记录，因此自然的模式是建立只读数据结构并将其分配给全局变量，然后创建multiprocessing.Pool（将数据结构隐式复制到每个工作程序中）（fork）进行处理，然后使用imap_unordered并行处理记录。这种模式的骨架通常看起来像这样：

classifier = None
def classify_row(row):
    return classifier.classify(row)

def classify(classifier_spec, data_file):
    global classifier
    try:
        classifier = Classifier(classifier_spec)
        with open(data_file, "rt") as fp, \
             multiprocessing.Pool() as pool:
            rd = csv.DictReader(fp)
            yield from pool.imap_unordered(classify_row, rd)
    finally:
        classifier = None

由于全局变量以及classify和classify_row之间的隐式耦合，我对此不满意。理想情况下，我想写

def classify(classifier_spec, data_file):
    classifier = Classifier(classifier_spec)
    with open(data_file, "rt") as fp, \
         multiprocessing.Pool() as pool:
        rd = csv.DictReader(fp)
        yield from pool.imap_unordered(classifier.classify, rd)

但是这是行不通的，因为分类器对象通常包含无法腌制的对象（因为它们是由扩展模块定义的，而扩展模块的作者对此并不关心）。我还读到，如果它能正常工作，那真的会很慢，因为分类器对象会在绑定方法的每次调用时被复制到工作进程中。

还有更好的选择吗？我只在乎3.x。

Answer 1

这真是棘手。此处的关键是保留对在派生时可用的变量的读取访问权，而无需序列化。大多数在多处理中共享内存的解决方案最终都会序列化。我尝试使用weakref.proxy来传递未进行序列化的分类器，但这没有用，因为莳萝和泡菜都将尝试跟随并序列化引用对象。但是，模块引用起作用。

该组织使我们与我们紧密联系：

import multiprocessing as mp
import csv


def classify(classifier, data_file):

    with open(data_file, "rt") as fp, mp.Pool() as pool:
        rd = csv.DictReader(fp)
        yield from pool.imap_unordered(classifier.classify, rd)


def orchestrate(classifier_spec, data_file):
    # construct a classifier from the spec; note that we can
    # even dynamically import modules here, using config values
    # from the spec
    import classifier_module
    classifier_module.init(classifier_spec)
    return classify(classifier_module, data_file)


if __name__ == '__main__':
    list(orchestrate(None, 'data.txt'))

此处需要注意一些更改：

我们为一些DI优点添加了orchestrate方法；精心策划如何构造/初始化分类器，然后将其交给classify，将两者解耦
classify仅需要假设classifier参数具有classify方法；不在乎它是实例还是模块

对于此概念证明，我们提供了一个显然不可序列化的分类器：

# classifier_module.py
def _create_classifier(spec):

    # obviously not pickle-able because it's inside a function
    class Classifier():

        def __init__(self, spec):
            pass

        def classify(self, x):
            print(x)
            return x

    return Classifier(spec)


def init(spec):
    global __classifier
    __classifier = _create_classifier(spec)


def classify(x):
    return __classifier.classify(x)

不幸的是，这里仍然有一个全局变量，但是现在它已经很好地封装在模块中作为私有变量，并且该模块导出了一个由classify和init函数组成的紧密接口。

这种设计释放了一些可能性：

orchestrate可以根据在classifier_spec中看到的内容导入和初始化不同的分类器模块。
一个人也可以将某个Classifier类的实例传递给classify，只要该实例可序列化并且具有相同签名的分类方法即可。

Answer 2

如果要使用分叉，我看不到使用全局方法的方法。但是我也看不出为什么在这种情况下使用全局变量会感到不舒服，也没有使用多线程操作全局列表的原因。

不过，您可以应对示例中的丑陋情况。您想直接传递classifier.classify，但是Classifier对象包含无法腌制的对象。

import os
import csv
import uuid
from threading import Lock
from multiprocessing import Pool
from weakref import WeakValueDictionary

class Classifier:

    def __init__(self, spec):
        self.lock = Lock()  # unpickleable
        self.spec = spec

    def classify(self, row):
        return f'classified by pid: {os.getpid()} with spec: {self.spec}', row

我建议我们将Classifier子类化，并定义__getstate__和__setstate__以启用酸洗。由于无论如何都在使用派生，因此它必须腌制的所有状态都是有关如何获得对派生的全局实例的引用的信息。然后，我们将使用分叉实例的__dict__（尚未通过减少酸洗的操作）更新腌制对象的__dict__（实例尚未完成）。

要在不增加模板的情况下实现此目的，子类Classifier实例必须为其自身生成一个名称，并将其注册为全局变量。该第一个引用将是一个弱引用，因此可以在用户期望时对实例进行垃圾回收。用户在分配classifier = Classifier(classifier_spec)时创建第二个引用。这一个不必一定是全球性的。

以下示例中生成的名称是在标准库的uuid模块的帮助下生成的。一个uuid会转换为一个字符串，然后编辑为一个有效的标识符（虽然不是必须的，但是在交互模式下进行调试很方便）。

class SubClassifier(Classifier):

    def __init__(self, spec):
        super().__init__(spec)
        self.uuid = self._generate_uuid_string()
        self.pid = os.getpid()
        self._register_global()

    def __getstate__(self):
        """Define pickled content."""
        return {'uuid': self.uuid}

    def __setstate__(self, state):
        """Set state in child process."""
        self.__dict__ = state
        self.__dict__.update(self._get_instance().__dict__)

    def _get_instance(self):
        """Get reference to instance."""
        return globals()[self.uuid][self.uuid]

    @staticmethod
    def _generate_uuid_string():
        """Generate id as valid identifier."""
        # return 'uuid_' + '123' # for testing
        return 'uuid_' + str(uuid.uuid4()).replace('-', '_')

    def _register_global(self):
        """Register global reference to instance."""
        weakd = WeakValueDictionary({self.uuid: self})
        globals().update({self.uuid: weakd})

    def __del__(self):
        """Clean up globals when deleted in parent."""
        if os.getpid() == self.pid:
            globals().pop(self.uuid)

最可喜的是，样板完全消失了。您不必手动声明和删除全局变量，因为该实例本身在后台管理所有事情：

def classify(classifier_spec, data_file, n_workers):
    classifier = SubClassifier(classifier_spec)
    # assert globals()['uuid_123']['uuid_123'] # for testing
    with open(data_file, "rt") as fh, Pool(n_workers) as pool:
        rd = csv.DictReader(fh)
        yield from pool.imap_unordered(classifier.classify, rd)


if __name__ == '__main__':

    PATHFILE = 'data.csv'
    N_WORKERS = 4

    g = classify(classifier_spec='spec1', data_file=PATHFILE, n_workers=N_WORKERS)
    for record in g:
        print(record)

   # assert 'uuid_123' not in globals() # no reference left

Answer 3

multiprocessing.sharedctypes模块提供了用于从共享内存中分配ctypes对象的功能，这些对象可以由子进程继承，即父级和子级可以访问共享内存。

您可以使用
1. multiprocessing.sharedctypes.RawArray从共享内存中分配一个ctypes数组。
2. multiprocessing.sharedctypes.RawValue从共享内存中分配一个ctypes对象。

王勉之博士为此写了very detailed document。您可以共享多个multiprocessing.sharedctypes对象。

您可能会发现solution here对您有用。

避免在多处理之间共享状态的全局变量。

3 个答案: