迭代器和生成器现在是内存高效代码的标准。现在,当我需要处理长列表时,我会尽量使用它们。通过迭代器迭代大文件(> 500Mb)时,是否可以使用多行正则表达式?
经典方式:
import re
my_regex = re.compile(r'some text', re.MULTILINE)
with open('my_large_file.txt', 'r') as f:
text = f.read() # Stores the whole text in a list
# This is memory consuming
result = my_regex.findall(text)
迭代器方式:
import re
my_regex = re.compile(r'some text', re.MULTILINE)
with open('my_large_file.txt', 'r') as f:
for line in f: # Use the file as an iterator and
# loop over the lines
# What could I do?
最小工作示例:
大文件:
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor.
--------------------------------
Some text I want to capture
--------------------------------
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor.
我的正则表达式:
my_regex = re.compile(r"[-]+$\n(.+)\n\s[-]+", re.MULTILINE)
答案 0 :(得分:2)
您可以做的是遍历文件行,并将它们连接为正在运行的文本,然后使用regexp对其进行测试。找到匹配项后,您可以清空正在运行的文本。
text = ''
results = []
with open('my_large_file.txt', 'r') as f:
for line in f:
text += line
result = my_regex.findall(text)
if result:
results += result
text = ''