Question

我在SO上看过几个类似的问题（复制触发线或确定大小的块），但它们并不适合我想要做的事情。我有一个非常大的文本文件（Valgrind的输出），我只想减少我需要的部分。

文件的结构如下：它们是以包含字符串'in loss record'的标题行开头的行块。我想只触发那些也包含字符串'definitely lost'的标题行，然后复制下面的所有行，直到达到另一个标题行（此时重复决策过程）。

如何在Python中实现这样的选择和复制脚本？

这是我到目前为止所尝试的内容。它有效，但我不认为这是最有效（或pythonic）的方式，因此我希望看到更快的方法，因为我正在使用的文件通常非常大。（对于290M文件，此方法需要1.8秒）

with open("in_file.txt","r") as fin:
with open("out_file.txt","w") as fout:                                                                                                                                     
    lines = fin.read().split("\n")
    i=0
    while i<len(lines):
        if "blocks are definitely lost in loss record" in lines[i]:
            fout.write(lines[i].rstrip()+"\n")
            i+=1
            while i<len(lines) and "loss record" not in lines[i]:
                fout.write(lines[i].rstrip()+"\n")
                i+=1
        i+=1

Answer 1

您可以尝试使用正则表达式并使用mmap

类似于：

import re, mmap

# create a regex that will define each block of text you want here:
pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M)
with open(fn, 'r+b') as f:
    mm = mmap.mmap(f.fileno(), 0)
    for i, m in enumerate(pat.finditer(mm)):
        # m is a block that you want.
        print m.group(1)

鉴于你没有输入的例子，那个正则表达式肯定不起作用 - 但是你明白了。

使用mmap整个文件被视为一个字符串，但不一定都在内存中，因此可以搜索大文件，并以这种方式选择它的块。

如果您的文件适合内存，您可以直接读取文件并使用正则表达式（伪Python）：

with open(fn) as fo: 
    pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M)
    for i, block in pat.finditer(of.read()):
         # deal with each block

如果您想要逐行非正则表达式方法，请逐行读取文件（假设它是\n分隔的文本文件）：

 with open(fn) as fo: 
     for line in fo: 
         # deal with each line here 

         # DON'T do something like string=fo.read() and 
         # then iterate over the lines of the string please...
         # unless you need random access to the lines out of order

Answer 2

另一种方法是使用groupby来标识标题行并设置将写入或忽略后续行的函数。然后，您可以逐行迭代文件并减少内存占用。

import itertools

def megs(val):
    return val * (2**20)

def ignorelines(lines):
    for line in lines:
        pass

# assuming ascii or utf-8 you save a small amount of processing by avoiding decode/encode
# and a few fewer trips to the disk with larger buffers
with    open('test.log', 'rb', buffering=megs(4)) as infile,\
        open('out.log', 'wb', buffering=megs(4)) as outfile:
    dump_fctn = ignorelines # ignore lines til we see a good header
    # group by header or contained lines
    for is_hdr, block in itertools.groupby(infile, lambda x: b'in loss record' in x):
        if is_hdr:
            for hdr in block:
                if b'definitely lost' in hdr:
                    outfile.write(hdr)
                    dump_fctn = outfile.writelines
                else:
                    dump_fctn = ignorelines
        else:
            # either writelines or ignorelines, depending on last header seen
            dump_fctn(block)

print(open('out.log').read())

在python中读取和复制特定的文本块

2 个答案: