Question

迭代器和生成器现在是内存高效代码的标准。现在，当我需要处理长列表时，我会尽量使用它们。通过迭代器迭代大文件（> 500Mb）时，是否可以使用多行正则表达式？

经典方式：

import re
my_regex = re.compile(r'some text', re.MULTILINE)

with open('my_large_file.txt', 'r') as f:
    text = f.read() # Stores the whole text in a list
                    # This is memory consuming    
result = my_regex.findall(text)

迭代器方式：

import re
my_regex = re.compile(r'some text', re.MULTILINE)

with open('my_large_file.txt', 'r') as f:
    for line in f: # Use the file as an iterator and
                   # loop over the lines
                   # What could I do?

最小工作示例：

大文件：

Lorem ipsum dolor sit amet, 
consectetur adipiscing elit, 
sed do eiusmod tempor. 
--------------------------------
Some text I want to capture
--------------------------------
Lorem ipsum dolor sit amet,
consectetur adipiscing elit, 
sed do eiusmod tempor.

我的正则表达式：

my_regex = re.compile(r"[-]+$\n(.+)\n\s[-]+", re.MULTILINE)

Answer 1

您可以做的是遍历文件行，并将它们连接为正在运行的文本，然后使用regexp对其进行测试。找到匹配项后，您可以清空正在运行的文本。

text = ''
results = []
with open('my_large_file.txt', 'r') as f:
    for line in f:
        text += line
        result = my_regex.findall(text)
        if result:
            results += result
            text = ''

多行正则表达式与迭代器兼容吗？

1 个答案: