你好,
我偶然发现ProcessPoolExecutor
出现问题,进程可以访问
数据,他们应该不能够。让我解释一下:
我的情况类似于以下示例:我有几次跑步要开始 每个都有不同的参数。他们并行计算自己的东西,没有 彼此互动的原因。据我了解,现在 叉,它复制自己。子进程具有相同的(内存)数据, 它的父级,但如果要更改任何内容,它会在自己的副本上进行更改。如果我 希望这些更改能够在子进程的生命周期中生存下来,我会 调用队列,管道和其他IPC内容。
但我实际上不知道!每个进程都为自己处理数据, 不应延续到其他任何运行。以下示例显示 否则,尽管如此。下次运行(非并行运行)可以访问 之前运行的数据,暗示该数据尚未清除 从过程中。
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import current_process, set_start_method
class Static:
integer: int = 0
def inprocess(run: int) -> None:
cp = current_process()
# Print current state
print(f"[{run:2d} {cp.pid} {cp.name}] int: {Static.integer}", flush=True)
# Check value
if Static.integer != 0:
raise Exception(f"[{run:2d} {cp.pid} {cp.name}] Variable already set!")
# Update value
Static.integer = run + 1
def pooling():
cp = current_process()
# Get master's pid
print(f"[{cp.pid} {cp.name}] Start")
with ProcessPoolExecutor(max_workers=2) as executor:
for i, _ in enumerate(executor.map(inprocess, range(4))):
print(f"run #{i} finished", flush=True)
if __name__ == '__main__':
set_start_method("fork") # enforce fork
pooling()
[1998 MainProcess] Start
[ 0 2020 Process-1] int: 0
[ 2 2020 Process-1] int: 1
[ 1 2021 Process-2] int: 0
[ 3 2021 Process-2] int: 2
run #0 finished
run #1 finished
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.6/concurrent/futures/process.py", line 175, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/lib/python3.6/concurrent/futures/process.py", line 153, in _process_chunk
return [fn(*args) for args in chunk]
File "/usr/lib/python3.6/concurrent/futures/process.py", line 153, in <listcomp>
return [fn(*args) for args in chunk]
File "<stdin>", line 14, in inprocess
Exception: [ 2 2020 Process-1] Variable already set!
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 29, in <module>
File "<stdin>", line 24, in pooling
File "/usr/lib/python3.6/concurrent/futures/process.py", line 366, in _chain_from_iterable_of_lists
for element in iterable:
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
yield fs.pop().result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
Exception: [ 2 2020 Process-1] Variable already set!
此过程也可以用max_workers=1
重现。
重新使用。启动方法对错误没有任何影响(尽管只有"fork"
似乎使用了多个进程)。
因此,总结一下:我希望每个新运行的进程都具有所有先前的数据,但是 没有其他任何运行的新数据。那可能吗?我要如何达到 它?为什么上述方法不能完全做到这一点?
感谢您的帮助。
我发现multiprocessing.pool.Pool
可以在其中设置maxtasksperchild=1
,所以
工作任务完成后,工作进程将被销毁。但我不喜欢
multiprocessing
界面; ProcessPoolExecutor
更适合
采用。此外,该池的整体思想是节省流程设置时间,
在每次运行后终止托管进程时,将被取消。
答案 0 :(得分:3)
python中的全新进程不共享内存状态。但是ProcessPoolExecutor
重用了流程实例。毕竟这是一个活动进程池。我认为这样做是为了防止操作系统一直无所事事地启动和启动进程。
您会在celery等其他分发技术中看到相同的行为,如果不注意,可以在两次执行之间流血全局状态。
我建议您更好地管理名称空间以封装数据。例如,使用示例,您可以将代码和数据封装在您在inprocess()
中实例化的父类中,而不是将其存储在类中的静态字段之类的共享名称空间中或直接存储在模块中。这样,该对象将最终被垃圾收集器清除:
class State:
def __init__(self):
self.integer: int = 0
def do_stuff():
self.integer += 42
def use_global_function(state):
state.integer -= 1664
state.do_stuff()
def inprocess(run: int) -> None:
cp = current_process()
state = State()
print(f"[{run:2d} {cp.pid} {cp.name}] int: {state.integer}", flush=True)
if state.integer != 0:
raise Exception(f"[{run:2d} {cp.pid} {cp.name}] Variable already set!")
state.integer = run + 1
state.do_stuff()
use_global_function(state)
答案 1 :(得分:0)
我遇到了一些潜在的类似问题,并在这个 High Memory Usage Using Python Multiprocessing 中看到了一些有趣的帖子,这些帖子指向使用 gc.collector(),但是在您的情况下它没有用。于是想到了Static类是怎么初始化的,几点:
{
class Static:
integer: int = 0
def __init__(self):
pass
def inprocess(run: int) -> None:
cp = current_process()
# Print current state
print(f"[{run:2d} {cp.pid} {cp.name}] int: {Static().integer}", flush=True)
# Check value
if Static().integer != 0:
raise Exception(f"[{run:2d} {cp.pid} {cp.name}] Variable already set!")
# Update value
Static().integer = run + 1
def pooling():
cp = current_process()
# Get master's pid
print(f"[{cp.pid} {cp.name}] Start")
with ProcessPoolExecutor(max_workers=2) as executor:
for i, _ in enumerate(executor.map(inprocess, range(4))):
print(f"run #{i} finished", flush=True)
if __name__ == "__main__":
print("start")
# set_start_method("fork") # enforce fork , ValueError: cannot find context for 'fork'
set_start_method("spawn") # Alternative
pooling()
}
返回:
[ 0 1424 SpawnProcess-2] int: 0
[ 1 1424 SpawnProcess-2] int: 0
run #0 finished
[ 2 17956 SpawnProcess-1] int: 0
[ 3 1424 SpawnProcess-2] int: 0
run #1 finished
run #2 finished
run #3 finished