我需要逐字处理文本。由于我编写的顺序程序非常慢,因此我尝试使用多处理库对其进行编码。我发现多处理软件比顺序软件要慢得多。使用Pool函数时,代码中是否缺少某些内容? do_something函数执行许多fors和ifs。
顺序代码:
class Text():
def do_something(self, word):
....
# Computational heavy code
....
return new_word
....
new_text = []
for sentence in text:
new_sentence = []
for word in sentence:
....
new_word = Text().do_something(word)
new_sentence += new_word
new_text.append(new_sentence)
print(new_text)
多进程代码:
class Text():
def do_something(self, word):
....
# Computational heavy code
....
return new_word
def do_word(self, word):
....
if len(word) > 2:
return self.do_something(word).split('$')
else:
return ['NONE']
def do_text(self, text):
new_text = []
pool = Pool(processes = cpu_count())
for sentence in text:
new_text.append( [item for sublist in pool.map(self.do_word, sentence.split()) for item in sublist if item != 'NONE'] )
return new_text
if __name__ == "__main__":
....
print(Text().text(file))
根据Panagiotis Kanavos的建议,我尝试实现多线程而不是多处理。但是,运行下面的代码,该机器似乎仅使用一个内核(cpu的使用率约为25%,而我有4内核的cpu)。速度似乎与使用顺序代码所获得的速度相同(它也具有25%的CPU使用率)。
from multiprocessing.dummy import Pool as ThreadPool
class Text():
def do_something(self, word):
....
# Computational heavy code
....
return new_word
def do_word(self, word):
....
if len(word) > 2:
return self.do_something(word).split('$')
else:
return ['NONE']
def do_text(self, text):
new_text = []
pool = ThreadPool(processes = cpu_count())
for sentence in text:
new_text.append( [item for sublist in pool.map(self.do_word, sentence.split()) for item in sublist if item != 'NONE'] )
return new_text
if __name__ == "__main__":
....
print(Text().text(file))