我需要逐行读取大型(10GB +)文件并处理每一行。处理相当简单,因此多处理似乎是最佳选择。但是,当我设置它时,它比线性运行要慢得多。我的CPU使用率从未超过50%,因此它不是处理能力问题。
我在Mac上的Jupyter Notebook中运行Python 3.6。
这就是我所拥有的,根据已发布的here发布的答案:
def do_work(in_queue, out_list):
while True:
line = in_queue.get()
# exit signal
if line == None:
return
#fake work for testing
elements = line.split("\t")
out_list.append(elements)
if __name__ == "__main__":
num_workers = 4
manager = Manager()
results = manager.list()
work = manager.Queue(num_workers)
# start for workers
pool = []
for i in range(num_workers):
p = Process(target=do_work, args=(work, results))
p.start()
pool.append(p)
# produce data
with open(file_on_my_machine, 'rt',newline="\n") as f:
for line in f:
work.put(line)
for p in pool:
p.join()
# get the results
print(sorted(results))