我在SO上看过几个类似的问题(复制触发线或确定大小的块),但它们并不适合我想要做的事情。我有一个非常大的文本文件(Valgrind的输出),我只想减少我需要的部分。
文件的结构如下:它们是以包含字符串'in loss record'
的标题行开头的行块。我想只触发那些也包含字符串'definitely lost'
的标题行,然后复制下面的所有行,直到达到另一个标题行(此时重复决策过程)。
如何在Python中实现这样的选择和复制脚本?
这是我到目前为止所尝试的内容。它有效,但我不认为这是最有效(或pythonic)的方式,因此我希望看到更快的方法,因为我正在使用的文件通常非常大。 (对于290M文件,此方法需要1.8秒)
with open("in_file.txt","r") as fin:
with open("out_file.txt","w") as fout:
lines = fin.read().split("\n")
i=0
while i<len(lines):
if "blocks are definitely lost in loss record" in lines[i]:
fout.write(lines[i].rstrip()+"\n")
i+=1
while i<len(lines) and "loss record" not in lines[i]:
fout.write(lines[i].rstrip()+"\n")
i+=1
i+=1
答案 0 :(得分:2)
您可以尝试使用正则表达式并使用mmap
类似于:
import re, mmap
# create a regex that will define each block of text you want here:
pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M)
with open(fn, 'r+b') as f:
mm = mmap.mmap(f.fileno(), 0)
for i, m in enumerate(pat.finditer(mm)):
# m is a block that you want.
print m.group(1)
鉴于你没有输入的例子,那个正则表达式肯定不起作用 - 但是你明白了。
使用mmap
整个文件被视为一个字符串,但不一定都在内存中,因此可以搜索大文件,并以这种方式选择它的块。
如果您的文件适合内存,您可以直接读取文件并使用正则表达式(伪Python):
with open(fn) as fo:
pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M)
for i, block in pat.finditer(of.read()):
# deal with each block
如果您想要逐行非正则表达式方法,请逐行读取文件(假设它是\n
分隔的文本文件):
with open(fn) as fo:
for line in fo:
# deal with each line here
# DON'T do something like string=fo.read() and
# then iterate over the lines of the string please...
# unless you need random access to the lines out of order
答案 1 :(得分:0)
另一种方法是使用groupby
来标识标题行并设置将写入或忽略后续行的函数。然后,您可以逐行迭代文件并减少内存占用。
import itertools
def megs(val):
return val * (2**20)
def ignorelines(lines):
for line in lines:
pass
# assuming ascii or utf-8 you save a small amount of processing by avoiding decode/encode
# and a few fewer trips to the disk with larger buffers
with open('test.log', 'rb', buffering=megs(4)) as infile,\
open('out.log', 'wb', buffering=megs(4)) as outfile:
dump_fctn = ignorelines # ignore lines til we see a good header
# group by header or contained lines
for is_hdr, block in itertools.groupby(infile, lambda x: b'in loss record' in x):
if is_hdr:
for hdr in block:
if b'definitely lost' in hdr:
outfile.write(hdr)
dump_fctn = outfile.writelines
else:
dump_fctn = ignorelines
else:
# either writelines or ignorelines, depending on last header seen
dump_fctn(block)
print(open('out.log').read())