Question

我启动多个进程以创建新对象列表。 htop向我展示了1到4个进程（我总是创建3个新对象）。

def foo(self):
    with multiprocessing.Pool(processes=3, maxtasksperchild=10) as pool:
        result = pool.map_async(self.new_obj, self.information)
        self.new_objs = result.get()
        pool.terminate()
    gc.collect()

我多次调用foo()，每次调用它时，整个过程运行得更慢，程序最终都没有完成，因为它减慢到很多。程序开始占用我所有的RAM，而顺序方法没有任何重要的RAM使用。

当我杀死程序时，大多数情况下这是程序上次执行的功能。

->File "threading.py", line 293, in wait
    waiter.acquire()

修改提供有关我的情况的一些信息。我创建了一个由节点组成的树。父节点调用foo()以创建其子节点。进程返回的result是这些子节点。它们保存在父节点的列表中。我希望并行创建这些子节点，而不是按顺序创建它们。

Answer 1

我认为您的问题主要与您的并行化函数是对象的方法这一事实有关。没有更多信息很难确定，但考虑一下这个小玩具程序：

import multiprocessing as mp
import numpy as np
import gc


class Object(object):
    def __init__(self, _):
        self.data = np.empty((100, 100, 100), dtype=np.float64)


class Container(object):
    def __new__(cls):
        self = object.__new__(cls)
        print("Born")
        return self

    def __init__(self):
        self.objects = []

    def foo(self):
        with mp.Pool(processes=3, maxtasksperchild=10) as pool:
            result = pool.map_async(self.new_obj, range(50))
            self.objects.extend(result.get())
            pool.terminate()
        gc.collect()

    def new_obj(self, i):
        return Object(i)

    def __del__(self):
        print("Dead")


if __name__ == '__main__':
    c = Container()
    for j in range(5):
        c.foo()

现在Container只被调用一次，因此您希望看到"Born"，然后打印出"Dead";但由于进程执行的代码是容器的方法，这意味着整个容器必须在别处执行！运行此功能后，当您的容器正在地图的每个执行上重建时，您会看到混合的"Born"和"Dead"流：

Born
Born
Born
Born
Born
Dead
Born
Dead
Dead
Born
Dead
Born
... 
<MANY MORE LINES HERE>
...
Born
Dead

为了说服自己每次复制整个容器并在周围发送，请尝试设置一些不可序列化的值：

def foo(self): with mp.Pool(processes=3, maxtasksperchild=10) as pool: result = pool.map_async(self.new_obj, range(50)) self.fn = lambda x: x**2 self.objects.extend(result.get()) pool.terminate() gc.collect()

由于无法序列化容器，因此会立即引发AttributeError。

总结一下：当向池中发送1000个请求时，Container将被序列化，发送到进程并在那里反序列化 1000次。当然，它们最终会被丢弃（假设没有太多奇怪的交叉引用），但这肯定会给RAM带来很大的压力，因为对象被序列化，调用，更新，重新序列化...映射输入中的元素。

你怎么解决这个问题？好吧，理想情况下，不要分享州：

def new_obj(_): return Object(_) class Container(object): def __new__(cls): self = object.__new__(cls) print("Born") return self def __init__(self): self.objects = [] def foo(self): with mp.Pool(processes=3, maxtasksperchild=10) as pool: result = pool.map_async(new_obj, range(50)) self.objects.extend(result.get()) pool.terminate() gc.collect() def __del__(self): print("Dead")

这在一小部分时间内完成，并且只在RAM上产生最小的飞艇（因为单个Container已经构建）。如果您需要传递一些内部状态，请将其解压缩并发送：

def new_obj(tup): very_important_state, parameters = tup return Object(very_important_state=very_important_state, parameters=parameters) class Container(object): def __new__(cls): self = object.__new__(cls) print("Born") return self def __init__(self): self.objects = [] def foo(self): important_state = len(self.objects) with mp.Pool(processes=3, maxtasksperchild=10) as pool: result = pool.map_async(new_obj, ((important_state, i) for i in range(50))) self.objects.extend(result.get()) pool.terminate() gc.collect() def __del__(self): print("Dead")

这与以前的行为相同。如果您绝对无法避免在进程之间共享某些可变状态，请执行the multiprocessing tools以执行此操作，而无需每次都复制所有内容。

Python3：多处理消耗大量RAM并减慢

1 个答案: