Safely write to file in parallel with pathos.multiprocessing

时间:2015-09-14 15:26:40

标签: python python-multiprocessing pathos

pathos.multiprocessing is known to have advantage over multiprocessing library in Python in the sense that the former uses dill instead of pickle and can serialize wider range of functions and other things.

But when it comes to writing pool.map() results to file line-wise using pathos, there comes up some trouble. If all processes in ProcessPool write results line-wise into a single file, they would interfere to each other writing some lines simultaneously and spoiling the job. In using ordinary multiprocessing package, I was able to make processes write to their own separate files, named with the current process id, like this:

example_data = range(100)
def process_point(point):
    output = "output-%d.gz" % mpp.current_process().pid
    with gzip.open(output, "a+") as fout:
        fout.write('%d\n' % point**2)

Then, this code works well:

import multiprocessing as mpp
pool = mpp.Pool(8)
pool.map(process_point, example_data)

But this code doesn't:

from pathos import multiprocessing as mpp
pool = mpp.Pool(8)
pool.map(process_point, example_data)

and throws AttributeError:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-a6fb174ec9a5> in <module>()
----> 1 pool.map(process_point, example_data)

/usr/local/lib/python2.7/dist-packages/processing-0.52_pathos-py2.7-linux-x86_64.egg/processing/pool.pyc in map(self, func, iterable, chunksize)
    128         '''
    129         assert self._state == RUN
--> 130         return self.mapAsync(func, iterable, chunksize).get()
    131
    132     def imap(self, func, iterable, chunksize=1):

/usr/local/lib/python2.7/dist-packages/processing-0.52_pathos-py2.7-linux-x86_64.egg/processing/pool.pyc in get(self, timeout)
    371             return self._value
    372         else:
--> 373             raise self._value
    374
    375     def _set(self, i, obj):

AttributeError: 'module' object has no attribute 'current_process'

There is no current_process() in pathos, and I cannot find anything similar to it. Any ideas?

2 个答案:

答案 0 :(得分:2)

This simple trick seems to do the job:

import multiprocessing as mp
from pathos import multiprocessing as pathos_mp
import gzip

example_data = range(100)
def process_point(point):
    output = "output-%d.gz" % mp.current_process().pid
    with gzip.open(output, "a+") as fout:
        fout.write('%d\n' % point**2)

pool = pathos_mp.Pool(8)
pool.map(process_point, example_data)

To put differently, one can use pathos for parallel computation, and ordinary multiprocessing package for getting id of current process, and this will work correctly!

答案 1 :(得分:2)

我是pathos作者。虽然你的答案适用于这种情况,但在multiprocessing pathos内使用pathos.helpers.mp的分叉可能会更好,这个分叉位于相当钝的位置:multiprocessing

这为您提供了与pathos.helpers.mp.current_process的一对一映射,但具有更好的序列化。因此,您使用SELECT dateadd(minute, datediff(minute,0,Time) / 15 * 15, 0) AS Time, round(avg(Amount),1) AS Amount, round(avg(Amount2),1) AS Amount2 INTO downt FROM ee3 JOIN ee4 ON DownLine1=DownLine2 --Where Time Null GROUP BY dateadd(minute, datediff(minute,0,Time) / 15 * 15, 0) ORDER BY time DESC; SELECT * FROM downt;

对不起,它没有记录,也没有明显......我应该改善这两个问题中的至少一个。