多行正则表达式与迭代器兼容吗?

时间:2019-05-09 21:00:24

标签: python regex python-3.x string iterator

迭代器和生成器现在是内存高效代码的标准。现在,当我需要处理长列表时,我会尽量使用它们。通过迭代器迭代大文件(> 500Mb)时,是否可以使用多行正则表达式?

经典方式:

import re
my_regex = re.compile(r'some text', re.MULTILINE)

with open('my_large_file.txt', 'r') as f:
    text = f.read() # Stores the whole text in a list
                    # This is memory consuming    
result = my_regex.findall(text) 

迭代器方式:

import re
my_regex = re.compile(r'some text', re.MULTILINE)

with open('my_large_file.txt', 'r') as f:
    for line in f: # Use the file as an iterator and
                   # loop over the lines
                   # What could I do?

最小工作示例:

大文件:

Lorem ipsum dolor sit amet, 
consectetur adipiscing elit, 
sed do eiusmod tempor. 
--------------------------------
Some text I want to capture
--------------------------------
Lorem ipsum dolor sit amet,
consectetur adipiscing elit, 
sed do eiusmod tempor.

我的正则表达式:

my_regex = re.compile(r"[-]+$\n(.+)\n\s[-]+", re.MULTILINE)   

1 个答案:

答案 0 :(得分:2)

您可以做的是遍历文件行,并将它们连接为正在运行的文本,然后使用regexp对其进行测试。找到匹配项后,您可以清空正在运行的文本。

text = ''
results = []
with open('my_large_file.txt', 'r') as f:
    for line in f:
        text += line
        result = my_regex.findall(text)
        if result:
            results += result
            text = ''