优化并行分析GB大小的文件

时间:2016-10-28 00:15:07

标签: python performance optimization multiprocessing text-processing

我有几个压缩文件,大小约为2GB压缩。每个文件的开头都有一组标题,我解析并提取~4,000,000个指针(pointers)的列表。

对于(pointers[i], pointers[i+1])的每对指针0 <= i < len(pointers),我

  • 寻求pointers[i]
  • 阅读pointers[i+1]-pointer[i]
  • 解压缩
  • 对该数据执行单次传递操作,并使用我找到的内容更新字典。

问题是,我只能使用一个Python进程每秒处理大约30个指针对,这意味着每个文件需要一天以上才能完成。

假设在多个进程之间拆分指针列表并不会影响性能(由于每个进程查看同一个文件,尽管不同的非重叠部分),我如何使用multiprocessing来加快速度吗?

我的单线程操作如下所示:

def search_clusters(pointers, filepath, automaton, counter):
    def _decompress_lzma(f, pointer, chunk_size=2**14):
        # skipping over this
        ...
        return uncompressed_buffer

    first_pointer, last_pointer = pointers[0], pointers[-1]
    with open(filepath, 'rb') as fh:
        fh.seek(first_pointer)
        f = StringIO(fh.read(last_pointer - first_pointer))

    for pointer1, pointer2 in zip(pointers, pointers[1:]):
        size = pointer2 - pointer1
        f.seek(pointer1 - first_pointer)
        buffer = _decompress_lzma(f, 0)

        # skipping details, ultimately the counter dict is
        # modified passing the uncompressed buffer through
        # an aho corasick automaton
        counter = update_counter_with_buffer(buffer, automaton, counter)

    return counter


# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers

counter = load_counter_dict() # returns collections.Counter()
automaton = load_automaton()

search_clusters(pointers, infile, autmaton, counter)

我尝试将其更改为使用multiprocessing.Pool

from itertools import repeat, izip
import logging
import multiprocessing

logger = multiprocessing.log_to_stderr()
logger.setLevel(multiprocessing.SUBDEBUG)

def chunked(pointers, chunksize=1024):
    for i in range(0, len(pointers), chunksize):
        yield list(pointers[i:i+chunksize+1])

def search_wrapper(args):
    return search_clusters(*args)

# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers

counter = load_counter_dict() # returns collections.Counter()

map_args = izip(chunked(cluster_pointers), repeat(infile),
                repeat(automaton.copy()), repeat(word_counter.copy()))

pool = multiprocessing.Pool(20)

results = pool.map(search_wrapper, map_args)
pool.close()
pool.join()

但是经过一段时间的处理后,我收到以下消息,脚本只是挂起而没有进一步的输出:

[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[INFO/PoolWorker-20] child process calling self.run()

但是,如果我使用序列化版本的map运行而没有进行多处理,那么运行得很好:

map(search_wrapper, map_args)

有关如何更改多处理代码以使其不挂起的任何建议吗?尝试使用多个进程来读取同一个文件甚至是个好主意?

0 个答案:

没有答案