我有一个80,000个字符串的列表,我正在通过话语解析器运行,为了提高这个过程的速度,我一直在尝试使用python多处理包。
解析器代码需要python 2.7,我目前正在使用字符串子集在2核Ubuntu机器上运行它。对于短列表,即20,该过程在两个核心上都没有问题,但是如果我运行大约100个字符串的列表,则两个工作人员将在不同点冻结(因此在某些情况下,工作人员1不会停止直到工人后几分钟2)。这在所有字符串完成并返回任何内容之前发生。在每次核心停止在同一点时给定相同的映射函数,但是如果我尝试不同的映射函数,即map vs map_async vs imap,这些点是不同的。
我已经尝试删除那些没有任何影响的索引的字符串,并且这些字符串在较短的列表中运行良好。基于我包含的打印语句,当进程看起来冻结时,当前的迭代似乎完成当前字符串,并且它不会移动到下一个字符串。大约需要一个小时的运行时间才能到达两个工人都已冻结的地方,而我却无法在更短的时间内重现问题。涉及多处理命令的代码是:
def main(initial_file, chunksize = 2):
entered_file = pd.read_csv(initial_file)
entered_file = entered_file.ix[:, 0].tolist()
pool = multiprocessing.Pool()
result = pool.map_async(discourse_process, entered_file, chunksize = chunksize)
pool.close()
pool.join()
with open("final_results.csv", 'w') as file:
writer = csv.writer(file)
for listitem in result.get():
writer.writerow([listitem[0], listitem[1]])
if __name__ == '__main__':
main(sys.argv[1])
当我使用Ctrl-C(但并不总是有效)停止进程时,我收到的错误消息是:
^CTraceback (most recent call last):
File "Combined_Script.py", line 94, in <module>
main(sys.argv[1])
File "Combined_Script.py", line 85, in main
pool.join()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 474, in join
p.join()
File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join
res = self._popen.wait(timeout)
File "/usr/lib/python2.7/multiprocessing/forking.py", line 154, in wait
return self.poll(0)
File "/usr/lib/python2.7/multiprocessing/forking.py", line 135, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Process PoolWorker-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 117, in worker
put((job, i, result))
File "/usr/lib/python2.7/multiprocessing/queues.py", line 390, in put
wacquire()
KeyboardInterrupt
^CProcess PoolWorker-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 117, in worker
put((job, i, result))
File "/usr/lib/python2.7/multiprocessing/queues.py", line 392, in put
return send(obj)
KeyboardInterrupt
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/usr/lib/python2.7/multiprocessing/util.py", line 305, in _exit_function
_run_finalizers(0)
File "/usr/lib/python2.7/multiprocessing/util.py", line 274, in _run_finalizers
finalizer()
File "/usr/lib/python2.7/multiprocessing/util.py", line 207, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 500, in _terminate_pool
outqueue.put(None) # sentinel
File "/usr/lib/python2.7/multiprocessing/queues.py", line 390, in put
wacquire()
KeyboardInterrupt
Error in sys.exitfunc:
Traceback (most recent call last):
File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/usr/lib/python2.7/multiprocessing/util.py", line 305, in _exit_function
_run_finalizers(0)
File "/usr/lib/python2.7/multiprocessing/util.py", line 274, in _run_finalizers
finalizer()
File "/usr/lib/python2.7/multiprocessing/util.py", line 207, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 500, in _terminate_pool
outqueue.put(None) # sentinel
File "/usr/lib/python2.7/multiprocessing/queues.py", line 390, in put
wacquire()
KeyboardInterrupt
当我使用htop查看另一个命令窗口中的内存时,一旦工作人员冻结,内存为<3%。这是我第一次尝试并行处理,我不确定还有什么可能会丢失?
答案 0 :(得分:1)
您可以为流程定义一个返回结果的时间,否则会引发错误:
try:
result.get(timeout = 1)
except multiprocessing.TimeoutError:
print("Error while retrieving the result")
您还可以验证您的过程是否成功
import time
while True:
try:
result.succesful()
except Exception:
print("Result is not yet succesful")
time.sleep(1)
最后,签出https://docs.python.org/2/library/multiprocessing.html很有帮助。
答案 1 :(得分:0)
我无法解决多处理池的问题,但是我遇到了loky包并且能够使用它来运行我的代码,其中包含以下几行:
executor = loky.get_reusable_executor(timeout = 200, kill_workers = True)
results = executor.map(discourse_process, entered_file)