首先让我说,这不是其他类似问题的重复,在其他类似问题中,人们倾向于更紧密地管理工人。
在使用multiprocessing.Pool.imap时,我一直在努力应对代码抛出的以下异常:
File "/usr/local/bin/homebrew/Cellar/python@2/2.7.17/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/usr/local/bin/homebrew/Cellar/python@2/2.7.17/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/bin/homebrew/Cellar/python@2/2.7.17/lib/python2.7/multiprocessing/pool.py", line 122, in worker
put((job, i, (False, wrapped)))
File "/usr/local/bin/homebrew/Cellar/python@2/2.7.17/lib/python2.7/multiprocessing/queues.py", line 390, in put
return send(obj)
IOError: [Errno 32] Broken pipe
在执行以下主程序时,这会在很多时候出现:
pool = mp.Pool(num_workers)
# Calculate a good chunksize (based on implementation of pool.map)
chunksize, extra = divmod(lengthData, 4 * num_workers)
if extra:
chunksize += 1
func = partial(pdf_to_txt, input_folder=inputFolder, junk_folder=imageJunkFolder, out_folder=outTextFolder,
log_name=log_name, log_folder=None,
empty_log=False, input_folder_iterator=None,
print_console=True)
flag_vec = pool.imap(func, (dataFrame['testo accordo'][i] for i in range(lengthData)), chunksize)
dataFrame['flags_conversion'] = pd.Series(flag_vec)
dataFrame.to_excel("{0}logs/{1}.xlsx".format(outTextFolder, nameOut))
pool.close()
pool.join()
仅供参考,该部分功能将获取非OCR PDF文件,将其分割为每页图像,然后使用pytesseract运行OCR。
我正在以下计算机上运行代码:
This is a physical machine (PowerEdge R930) running RedHat 7.7 (Linux 3.10.0).
Processor: Intel(R) Xeon(R) CPU E7-8880 v3 @ 2.30GHz (x144)
Memory: 1.48 TiB
Swap: 7.81 GiB
Uptime: 21 days
也许我应该减小块大小?我真的不清楚。我注意到,当服务器上的工作人员较少时,代码似乎可以更好地工作……
答案 0 :(得分:0)
经过大量的痛苦,我发现问题出在pdftoppm(即使用pdf2image)上。看来pdftoppm有时会卡住而没有引发任何异常。
如果有人遇到此问题,我热烈建议切换到PyMuPDF以从pdf中提取图像。它更快,更稳定!