Question

问题：

替换大型文本文件中的多个字符串模式需要花费大量时间。（Python）的

情境：

我有一个没有特定结构的大文本文件。但是，它包含几种模式。例如，电子邮件地址和电话号码。

文本文件有超过100种不同的此类模式，文件大小为10mb（大小可能会增加）。文本文件可能包含也可能不包含所有100种模式。

目前，我正在使用re.sub()替换匹配，执行替换的方法如下所示。

readfile = gzip.open(path, 'r') # read the zipped file
lines = readfile.readlines() # load the lines 

for line in lines:
    if len(line.strip()) != 0: # strip the empty lines
        linestr += line

for pattern in patterns: # patterns contains all regex and respective replaces
    regex = pattern[0]
    replace = pattern[1]
    compiled_regex = compile_regex(regex)
    linestr = re.sub(compiled_regex, replace, linestr)

这种方法需要花费大量时间来处理大文件。有没有更好的方法来优化它？

我正在考虑将+=替换为.join()，但不确定会有多大帮助。

Answer 1

您可以使用lineprofiler查找代码中哪些行占用的时间最多

pip install line_profiler    
kernprof -l run.py

另外一件事，我认为你在内存中构建的字符串太大了，也许你可以使用generators

Answer 2

您可以获得稍微好一点的结果：

large_list = []

with gzip.open(path, 'r') as fp:
    for line in fp.readlines():
        if line.strip():
            large_list.append(line)

merged_lines = ''.join(large_list)

for regex, replace in patterns:
    compiled_regex = compile_regex(regex)
    merged_lines = re.sub(compiled_regex, replace, merged_lines)

但是，可以通过了解您应用的处理类型来实现进一步优化。实际上，最后一行将占用所有CPU功率（和内存分配）。如果可以基于每行应用正则表达式，则可以使用多处理包获得很好的结果。由于GIL（https://wiki.python.org/moin/GlobalInterpreterLock）

，线程不会给你任何东西

用Python替换大文本文件中的多个字符串

2 个答案: