我有几个压缩文件,大小约为2GB压缩。每个文件的开头都有一组标题,我解析并提取~4,000,000个指针(pointers
)的列表。
对于(pointers[i], pointers[i+1])
的每对指针0 <= i < len(pointers)
,我
pointers[i]
pointers[i+1]-pointer[i]
问题是,我只能使用一个Python进程每秒处理大约30个指针对,这意味着每个文件需要一天以上才能完成。
假设在多个进程之间拆分指针列表并不会影响性能(由于每个进程查看同一个文件,尽管不同的非重叠部分),我如何使用multiprocessing
来加快速度吗?
我的单线程操作如下所示:
def search_clusters(pointers, filepath, automaton, counter):
def _decompress_lzma(f, pointer, chunk_size=2**14):
# skipping over this
...
return uncompressed_buffer
first_pointer, last_pointer = pointers[0], pointers[-1]
with open(filepath, 'rb') as fh:
fh.seek(first_pointer)
f = StringIO(fh.read(last_pointer - first_pointer))
for pointer1, pointer2 in zip(pointers, pointers[1:]):
size = pointer2 - pointer1
f.seek(pointer1 - first_pointer)
buffer = _decompress_lzma(f, 0)
# skipping details, ultimately the counter dict is
# modified passing the uncompressed buffer through
# an aho corasick automaton
counter = update_counter_with_buffer(buffer, automaton, counter)
return counter
# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers
counter = load_counter_dict() # returns collections.Counter()
automaton = load_automaton()
search_clusters(pointers, infile, autmaton, counter)
我尝试将其更改为使用multiprocessing.Pool
:
from itertools import repeat, izip
import logging
import multiprocessing
logger = multiprocessing.log_to_stderr()
logger.setLevel(multiprocessing.SUBDEBUG)
def chunked(pointers, chunksize=1024):
for i in range(0, len(pointers), chunksize):
yield list(pointers[i:i+chunksize+1])
def search_wrapper(args):
return search_clusters(*args)
# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers
counter = load_counter_dict() # returns collections.Counter()
map_args = izip(chunked(cluster_pointers), repeat(infile),
repeat(automaton.copy()), repeat(word_counter.copy()))
pool = multiprocessing.Pool(20)
results = pool.map(search_wrapper, map_args)
pool.close()
pool.join()
但是经过一段时间的处理后,我收到以下消息,脚本只是挂起而没有进一步的输出:
[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[INFO/PoolWorker-20] child process calling self.run()
但是,如果我使用序列化版本的map
运行而没有进行多处理,那么运行得很好:
map(search_wrapper, map_args)
有关如何更改多处理代码以使其不挂起的任何建议吗?尝试使用多个进程来读取同一个文件甚至是个好主意?