我正在对一堆文本文件进行一些预处理,当我为它做了imap版本时,最终结果甚至比正常的顺序执行慢。
def process_text((document)):
#doing_some_more_preprocessing_on_text
# extract named entities
return named_entities
def input_files(dir, fname):
# read each document as one big string
document = open(os.path.join(dir, fname)).read()
#tokenise here
yield tokenize(document)
def corpus_preprocessing(top_dir):
# save to another directory, if not exist, create one.
if not os.path.exists(top_dir+'_pre'):
os.makedirs(top_dir+'_pre')
# real multiprocessing starts here
for fname in os.listdir(top_dir):
for named_entities in pool.imap(process_text, input_files(top_dir, fname)):
with open(os.path.join(top_dir+'_pre', fname),'w') as handle:
json.dump(named_entities, handle)
pool.terminate()
# initialize pool
pool = Pool(multiprocessing.cpu_count())
# let's calculate time
now = time.time()
#provide the path to dataset here
top_dir = '/home/dataset/sample'
corpus_preprocessing(top_dir)
# print total time taken
print "Finished in", time.time()-now , "sec"
我也启用了,记录:
with imap:
[WARNING/MainProcess] doomed
[WARNING/MainProcess] doomed
[WARNING/MainProcess] doomed
[WARNING/MainProcess] doomed
[WARNING/MainProcess] doomed
[WARNING/MainProcess] doomed
Finished in 29.0439419746 sec
with single file at a time execution:
Finished in 18.4209680557 sec
“注定”消息是否意味着所有进程快速死亡,只留下一个进程来完成所有事情?有什么建议我在处理多处理时有什么问题吗?