Question

我的功能以下列方式逐行处理文件。定义了多个错误模式，需要应用于每一行。我使用multiprocessing.Pool来处理块中的文件。

1G文件的内存使用量增加到2G。并且在文件处理之后仍然保持2G。文件最后关闭。

如果我注释掉对re_pat.match的调用，则内存使用率是正常的并且保持在100Mb以下。

我是以错误的方式使用re吗？我无法找到解决内存泄漏问题的方法。

def line_match(lines, errors)
    for error in errors:
        try:
            re_pat = re.compile(error['pattern'])
        except Exception:
            print_error
            continue

        for line in lines:
            m = re_pat.match(line)
            # other code to handle matched object 



def process_large_file(fo):
    p = multiprocessing.Pool()
    while True:
        lines = list(itertools.islice(fo, line_per_proc))
        if not lines:
            break
        result = p.apply_async(line_match, args=(errors, lines))

注意：我省略了一些代码，因为我认为重要的区别是有没有re_pat.match（...）

Answer 1

有几件事......在这里看一下itertools.islice的用法。您必须指定开始和停止参数。你似乎不这样做。每个流程都在文件的同一行上运行，也不要将其转换为列表，因为您不需要同时使用所有行 - 这很浪费。

你需要做这样的事情：

line_per_proc = ...

count = 0
while True:
    lines = itertools.islice(fo, start=line_per_proc, stop=lines_per_proc * (count + 1))
    if not lines:
        break
    result = p.apply_async(line_match, args=(errors, lines))
    count += 1

接下来，对于每个模式，您将迭代lines。所以，如果你有10个模式，你会在同一行上迭代10次！如果你是一个大文件，这个效率特别低，特别是当这些行作为列表加载时！

试试这个：

def line_match(lines, errors):
    # first, compile all your patterns 
    compiled_patterns = []
    for error in errors:
        compiled_patterns.append(re.compile(error['pattern']))

    for line in lines: # lines is a generator, not a list
        for pattern in compiled_patterns:
            m = pattern.match(line)
            # other code to handle matched object

这可以帮助您开始使用完整的解决方案。

Python re.match递归调用内存泄漏

1 个答案: