在python中读取和复制特定的文本块

时间:2017-04-15 20:38:50

标签: python regex file-io

我在SO上看过几个类似的问题(复制触发线或确定大小的块),但它们并不适合我想要做的事情。我有一个非常大的文本文件(Valgrind的输出),我只想减少我需要的部分。

文件的结构如下:它们是以包含字符串'in loss record'的标题行开头的行块。我想只触发那些也包含字符串'definitely lost'的标题行,然后复制下面的所有行,直到达到另一个标题行(此时重复决策过程)。

如何在Python中实现这样的选择和复制脚本?

这是我到目前为止所尝试的内容。它有效,但我不认为这是最有效(或pythonic)的方式,因此我希望看到更快的方法,因为我正在使用的文件通常非常大。 (对于290M文件,此方法需要1.8秒)

with open("in_file.txt","r") as fin:
with open("out_file.txt","w") as fout:                                                                                                                                     
    lines = fin.read().split("\n")
    i=0
    while i<len(lines):
        if "blocks are definitely lost in loss record" in lines[i]:
            fout.write(lines[i].rstrip()+"\n")
            i+=1
            while i<len(lines) and "loss record" not in lines[i]:
                fout.write(lines[i].rstrip()+"\n")
                i+=1
        i+=1

2 个答案:

答案 0 :(得分:2)

您可以尝试使用正则表达式并使用mmap

类似于:

import re, mmap

# create a regex that will define each block of text you want here:
pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M)
with open(fn, 'r+b') as f:
    mm = mmap.mmap(f.fileno(), 0)
    for i, m in enumerate(pat.finditer(mm)):
        # m is a block that you want.
        print m.group(1)

鉴于你没有输入的例子,那个正则表达式肯定不起作用 - 但是你明白了。

使用mmap整个文件被视为一个字符串,但不一定都在内存中,因此可以搜索大文件,并以这种方式选择它的块。

如果您的文件适合内存,您可以直接读取文件并使用正则表达式(伪Python):

with open(fn) as fo: 
    pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M)
    for i, block in pat.finditer(of.read()):
         # deal with each block

如果您想要逐行非正则表达式方法,请逐行读取文件(假设它是\n分隔的文本文件):

 with open(fn) as fo: 
     for line in fo: 
         # deal with each line here 

         # DON'T do something like string=fo.read() and 
         # then iterate over the lines of the string please...
         # unless you need random access to the lines out of order

答案 1 :(得分:0)

另一种方法是使用groupby来标识标题行并设置将写入或忽略后续行的函数。然后,您可以逐行迭代文件并减少内存占用。

import itertools

def megs(val):
    return val * (2**20)

def ignorelines(lines):
    for line in lines:
        pass

# assuming ascii or utf-8 you save a small amount of processing by avoiding decode/encode
# and a few fewer trips to the disk with larger buffers
with    open('test.log', 'rb', buffering=megs(4)) as infile,\
        open('out.log', 'wb', buffering=megs(4)) as outfile:
    dump_fctn = ignorelines # ignore lines til we see a good header
    # group by header or contained lines
    for is_hdr, block in itertools.groupby(infile, lambda x: b'in loss record' in x):
        if is_hdr:
            for hdr in block:
                if b'definitely lost' in hdr:
                    outfile.write(hdr)
                    dump_fctn = outfile.writelines
                else:
                    dump_fctn = ignorelines
        else:
            # either writelines or ignorelines, depending on last header seen
            dump_fctn(block)

print(open('out.log').read())