多进程初始化器和酸洗

时间:2015-10-08 13:02:28

标签: python pickle python-multiprocessing

我一直在玩multiprocessing.Pool并尝试了解initializer参数的确切运作方式。根据我的理解,为每个进程调用初始化函数,因此我假设它的参数(即initargs)必须跨进程边界进行pickle。我知道池的map方法也使用pickle作为参数,所以我假设任何作为初始化器的参数的东西也应该作为映射的参数。

然而,当我运行以下代码时,initialize被调用就好了,但是map抛出了一个关于无法挑选模块的异常。 (使用当前模块作为参数并没有什么特别之处;它只是第一个出现在脑中的非pickle对象。)有谁知道这种差异背后可能是什么?

from __future__ import print_function
import multiprocessing
import sys


def get_pid():
    return multiprocessing.current_process().pid


def initialize(module):
    print('Got module {} in PID {}'.format(module, get_pid()))


def worker(module):
    print('Got module {} in PID {}'.format(module, get_pid()))


current_module = sys.modules[__name__]
work = [current_module]

print('Main process has PID {}'.format(get_pid()))
pool = multiprocessing.Pool(None, initialize, work)
pool.map(worker, work)

1 个答案:

答案 0 :(得分:1)

初始化不需要腌制,但map调用确实如此。也许这会有所帮助......(我在这里使用multiprocess代替multiprocessing来提供更好的酸洗和互动性。)

>>> from __future__ import print_function
>>> import multiprocess as multiprocessing
>>> import sys
>>> 
>>> def get_pid():
...     return multiprocessing.current_process().pid
... 
>>> 
>>> def initialize(module):
...     print('Got module {} in PID {}'.format(module, get_pid()))
... 
>>> 
>>> def worker(module):
...     print('Got module {} in PID {}'.format(module, get_pid()))
... 
>>> 
>>> current_module = sys.modules[__name__]
>>> work = [current_module]
>>> 
>>> print('Main process has PID {}'.format(get_pid()))
Main process has PID 34866
>>> pool = multiprocessing.dummy.Pool(None, initialize, work)
Got module <module '__main__' (built-in)> in PID 34866
Got module <module '__main__' (built-in)> in PID 34866
Got module <module '__main__' (built-in)> in PID 34866
Got module <module '__main__' (built-in)> in PID 34866
Got module <module '__main__' (built-in)> in PID 34866
Got module <module '__main__' (built-in)> in PID 34866
Got module <module '__main__' (built-in)> in PID 34866
Got module <module '__main__' (built-in)> in PID 34866
>>> pool.map(worker, work)
Got module <module '__main__' (built-in)> in PID 34866
[None]

冷却。线程pool有效......(因为它不需要腌制任何东西)。我们何时使用序列化同时发送workerwork

>>> pool = multiprocessing.Pool(None, initialize, work)
Got module <module '__main__' (built-in)> in PID 34875
Got module <module '__main__' (built-in)> in PID 34876
Got module <module '__main__' (built-in)> in PID 34877
Got module <module '__main__' (built-in)> in PID 34878
Got module <module '__main__' (built-in)> in PID 34879
Got module <module '__main__' (built-in)> in PID 34880
Got module <module '__main__' (built-in)> in PID 34881
Got module <module '__main__' (built-in)> in PID 34882
>>> pool.map(worker, work)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mmckerns/lib/python2.7/site-packages/multiprocess-0.70.4.dev0-py2.7-macosx-10.8-x86_64.egg/multiprocess/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/Users/mmckerns/lib/python2.7/site-packages/multiprocess-0.70.4.dev0-py2.7-macosx-10.8-x86_64.egg/multiprocess/pool.py", line 567, in get
    raise self._value
NotImplementedError: pool objects cannot be passed between processes or pickled
>>> 

让我们来看看酸洗work

>>> import pickle
>>> import sys            
>>> pickle.dumps(sys.modules[__name__])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1374, in dumps
    Pickler(file, protocol).dump(obj)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 306, in save
    rv = reduce(self.proto)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle module objects
>>> 

所以,你不能挑剔一个模块......好吧,我们能用dill做得更好吗?

>>> import dill
>>> dill.detect.trace(True)
>>> dill.pickles(work)
M1: <module '__main__' (built-in)>
F2: <function _import_module at 0x10c017cf8>
# F2
D2: <dict object at 0x10d9a8168>
M2: <module 'dill' from '/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/__init__.pyc'>
# M2
F1: <function worker at 0x10c07fed8>
F2: <function _create_function at 0x10c017488>
# F2
Co: <code object worker at 0x10b053cb0, file "<stdin>", line 1>
F2: <function _unmarshal at 0x10c017320>
# F2
# Co
D1: <dict object at 0x10af68168>
# D1
D2: <dict object at 0x10c0e4a28>
# D2
# F1
M2: <module 'sys' (built-in)>
# M2
F1: <function initialize at 0x10c07fe60>
Co: <code object initialize at 0x10b241f30, file "<stdin>", line 1>
# Co
D1: <dict object at 0x10af68168>
# D1
D2: <dict object at 0x10c0ea398>
# D2
# F1
M2: <module 'pathos' from '/Users/mmckerns/lib/python2.7/site-packages/pathos-0.2a1.dev0-py2.7.egg/pathos/__init__.pyc'>
# M2
C2: __future__._Feature
# C2
D2: <dict object at 0x10b05b7f8>
# D2
M2: <module 'multiprocess' from '/Users/mmckerns/lib/python2.7/site-packages/multiprocess-0.70.4.dev0-py2.7-macosx-10.8-x86_64.egg/multiprocess/__init__.pyc'>
# M2
T4: <class 'pathos.threading.ThreadPool'>
# T4
D2: <dict object at 0x10c0ea5c8>
# D2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 1209, in pickles
    pik = copy(obj, **kwds)
  File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 161, in copy
    return loads(dumps(obj, *args, **kwds))
  File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 197, in dumps
    dump(obj, file, protocol, byref, fmode, recurse)#, strictio)
  File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 190, in dump
    pik.dump(obj)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 600, in save_list
    self._batch_appends(iter(obj))
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 636, in _batch_appends
    save(tmp[0])
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 1116, in save_module
    state=_main_dict)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 419, in save_reduce
    save(state)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 768, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 649, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 681, in _batch_setitems
    save(v)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 306, in save
    rv = reduce(self.proto)
  File "/Users/mmckerns/lib/python2.7/site-packages/multiprocess-0.70.4.dev0-py2.7-macosx-10.8-x86_64.egg/multiprocess/pool.py", line 452, in __reduce__
    'pool objects cannot be passed between processes or pickled'
NotImplementedError: pool objects cannot be passed between processes or pickled
>>> 

答案是 - 模块开始发泡,但由于模块中的内容而失败...所以看起来它适用于__main__中的所有内容,除非是<{1}}中pool的一个实例 - 然后就会失败。

因此,如果你的最后两行代码被替换为这一行,它将起作用:

__main__

这是使用>>> multiprocessing.Pool(None, initialize, work).map(worker, work) Got module <module '__main__' (built-in)> in PID 34922 Got module <module '__main__' (built-in)> in PID 34923 Got module <module '__main__' (built-in)> in PID 34924 Got module <module '__main__' (built-in)> in PID 34925 Got module <module '__main__' (built-in)> in PID 34926 Got module <module '__main__' (built-in)> in PID 34927 Got module <module '__main__' (built-in)> in PID 34928 Got module <module '__main__' (built-in)> in PID 34929 Got module <module '__main__' (built-in)> in PID 34922 [None] >>> ,因为它使用了multiprocessdill仍然无法在这里腌制,因为pickle无法序列化模块。需要序列化,因为必须将对象发送到另一个进程上的另一个python实例。