Question

我有一个python脚本，它以随机的方式同时处理numpy数组和图像。为了在产生的进程中有适当的随机性，我将一个随机种子从主进程传递给工作者，以便为它们播种。

当maxtasksperchild使用Pool时，我的脚本会在多次运行Pool.map后挂起。

以下是重现问题的最小代码段：

# This code stops after multiprocessing.Pool workers are replaced one single time.
# They are replaced due to maxtasksperchild parameter to Pool
from multiprocessing import Pool
import numpy as np

def worker(n):
    # Removing np.random.seed solves the issue
    np.random.seed(1) #any seed value
    return 1234 # trivial return value

# Removing maxtasksperchild solves the issue
ppool = Pool(20 , maxtasksperchild=5)
i=0
while True:
    i += 1
    # Removing np.random.randint(10) or taking it out of the loop solves the issue
    rand = np.random.randint(10)
    l  = [3] # trivial input to ppool.map
    result = ppool.map(worker, l)
    print i,result[0]

这是输出

1 1234
2 1234
3 1234
.
.
.
99 1234
100 1234 # at this point workers should've reached maxtasksperchild tasks
101 1234
102 1234
103 1234
104 1234
105 1234
106 1234
107 1234
108 1234
109 1234
110 1234

然后无限期地挂起。

我可以用python的numpy.random替换random并解决问题。但是在我的实际应用程序中，worker将执行我无法控制的用户代码（作为worker的参数给出），并且希望允许在该用户代码中使用numpy.random函数。所以我故意想要为全局随机生成器播种（对于每个进程独立）。

使用Python 2.7.10，numpy 1.11.0,1.12.0＆amp; 1.13.0，Ubuntu和OSX

Answer 1

事实证明，这是来自threading.Lock和multiprocessing的Python错误互动。

np.random.seed和大多数np.random.*函数使用threading.Lock来确保线程安全。 np.random.*函数生成随机数，然后更新种子（跨线程共享），这就是需要锁定的原因。请参阅np.random.seed和cont0_array（由np.random.random()和其他人使用）。

现在这是如何导致上述代码段中出现问题的？

简而言之，代码段会挂起，因为分叉时会继承threading.Lock状态。因此，当一个孩子同时分叉时，父母会获得锁定（np.random.randint(10)），孩子就会陷入僵局（np.random.seed）。

@njsmith在这个github问题https://github.com/numpy/numpy/issues/9248#issuecomment-308054786

中解释了它

multiprocessing.Pool产生后台线程来管理工作人员：https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L170-L173

它在后台循环调用_maintain_pool：https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L366

如果一个工人退出，例如由于maxtasksperchild限制，则_maintain_pool调用_repopulate_pool：https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L240

然后_repopulate_pool分配了一些新工作者，仍然在这个后台线程中：https://github.com/python/cpython/blob/aefa7ebf0ff0f73feee7ab24f4cdcb2014d83ee5/Lib/multiprocessing/pool.py#L224

所发生的事情是，最终你不幸运，并且在你的主线程正在调用某个np.random函数并持有锁的同时，多处理决定派生一个子进程，该子进程以np.random锁开始已经举行但持有它的线程已经消失。然后孩子试图调用np.random，这需要锁定，所以孩子死锁。

这里简单的解决方法是不使用fork进行多处理。如果你使用spawn或forkserver启动方法，那么这应该消失。

正确修复......呃。我想我们需要注册一个pthread_atfork前叉处理程序，它在fork之前获取np.random锁，然后释放它？而且我想我们需要为numpy中的每个锁执行此操作，这需要保留每个RandomState对象的弱集，并且_FFTCache似乎也有锁......

（从好的方面来说，这也会让我们有机会重新初始化孩子的全局随机状态，在用户没有明确地将其播种的情况下我们应该这样做。）

Answer 2

使用numpy.random.seed不是线程安全的。 numpy.random.seed全局更改种子的值，而 - 据我所知 - 你试图在本地更改种子。

请参阅the docs

如果您确实想要实现的是在每个工作人员的开始播种发生器，以下是一个解决方案：

def worker(n):
    # Removing np.random.seed solves the problem                                                               
    randgen = np.random.RandomState(45678) # RandomState, not seed!
    # ...Do something with randgen...                                           
    return 1234 # trivial return value

Answer 3

这是一个完整的答案，因为它不适合评论。

在玩了一下之后，这里的东西闻起来就像一个numpy.random错误。我能够重现冻结的虫子，此外还有其他一些奇怪的东西不应该发生，比如手动播种发电机不起作用。

def rand_seed(rand, i):
    print(i)
    np.random.seed(i)
    print(i)
    print(rand())
def test1():
    with multiprocessing.Pool() as pool:
        [pool.apply_async(rand_seed, (np.random.random_sample, i)).get()
        for i in range(5)]
test1()

有输出

0
0
0.3205032737431185
1
1
0.3205032737431185
2
2
0.3205032737431185
3
3
0.3205032737431185
4
4
0.3205032737431185

另一方面，不传递np.random.random_sample作为参数可以正常工作。

def rand_seed2(i):
    print(i)
    np.random.seed(i)
    print(i)
    print(np.random.random_sample())
def test2():
    with multiprocessing.Pool() as pool:
        [pool.apply_async(rand_seed, (i,)).get()
        for i in range(5)]
test2()

有输出

0
0
0.5488135039273248
1
1
0.417022004702574
2
2
0.43599490214200376
3
3
0.5507979025745755
4
4
0.9670298390136767

这表明窗帘后面正在发生一些严重的蠢事。不知道还有什么要说的......

基本上，似乎numpy.random.seed不仅会修改＆＃34;种子状态＆＃34;变量，但random_sample函数本身。

为什么这个小片段使用maxtasksperchild，numpy.random.randint和numpy.random.seed进行多处理时挂起？

3 个答案: