逐行读取文件,但反向读取(最后一行,然后是最后一行,等等)

时间:2014-09-22 08:42:21

标签: python

我想从文件中删除尾随空白行(如果有的话)。目前我通过在内存中读取它,删除那里的空白行并覆盖它来实现它。但文件很大(30000多行,长行),这需要2-3秒。

所以我想逐行读取文件,但是后退,直到我到达第一个非空白行。也就是说,我从最后一行开始,然后是倒数第二行,等等。然后我会截断它,而不是覆盖它。

反向读取它的最佳方法是什么?现在我正在考虑读取64k的块,然后循环遍历字符串反向,char-by-char,直到我得到一行,然后当我用完64k时,读取另一个64k并预先添加它们等等。

我假设没有标准函数或库以相反的顺序读取?

2 个答案:

答案 0 :(得分:2)

这是我在别处找到的一些代码的修改版本(可能在StackOverflow上,实际上......) - 我已经提取了两个处理向后阅读的关键方法。

reversed_blocks迭代器以您喜欢的大小向后读取文件,reversed_lines迭代器将块分成行,保存第一个;如果下一个块以换行结束,则将其作为一个完整的行返回,如果它没有,则将保存的部分行追加到新块的最后一行,完成在块上拆分的行边界。

所有的状态都是由Python的迭代器机制维护的,所以我们不必在全球任何地方存储状态;这也意味着如果需要,您可以一次性向后读取多个文件,因为状态与迭代器相关联。

def reversed_lines(self, file):
    "Generate the lines of file in reverse order."
    newline_char_set = set(['\r', '\n'])
    tail = ""
    for block in self.reversed_blocks(file):
        if block is not None and len(block)>0:
            # First split the whole block into lines and reverse the list
            reversed_lines = block.splitlines()
            reversed_lines.reverse()

            # If the last char of the block is not a newline, then the last line
            # crosses a block boundary, and the tail (possible partial line from
            # the previous block) should be added to it.
            if block[-1] not in newline_char_set:
                reversed_lines[0] = reversed_lines[0] + tail

            # Otherwise, the block ended on a line boundary, and the tail is a 
            # complete line itself.
            elif len(tail)>0:
                reversed_lines.insert(0,tail)

            # Within the current block, we can't tell if the first line is complete
            # or not, so we extract it and save it for the next go-round with a new
            # block. We yield instead of returning so all the internal state of this
            # iteration is preserved (how many lines returned, current tail, etc.).
            tail = reversed_lines.pop()

            for reversed_line in reversed_lines:
                yield reversed_line

    # We're out of blocks now; if there's a tail left over from the last block we read,
    # it's the very first line in the file. Yield that and we're done.
    if len(tail)>0:
        yield tail

def reversed_blocks(self, file, blocksize=4096):
    "Generate blocks of file's contents in reverse order."

    # Jump to the end of the file, and save the file offset.
    file.seek(0, os.SEEK_END)
    here = file.tell()

    # When the file offset reaches zero, we've read the whole file.
    while 0 < here:
        # Compute how far back we can step; either there's at least one
        # full block left, or we've gotten close enough to the start that
        # we'll read the whole file.
        delta = min(blocksize, here)

        # Back up to there and read the block; we yield it so that the 
        # variable containing the file offset is retained.
        file.seek(here - delta, os.SEEK_SET)
        yield file.read(delta)

        # Move the pointer back by the amount we just handed out. If we've
        # read the last block, "here" will now be zero.
        here -= delta

reversed_lines是一个迭代器,所以你在循环中运行它:

for line in self.reversed_lines(fh):
    do_something_with_the_line(line)

这些评论可能是多余的,但在我研究迭代器如何完成工作时,它们对我很有用。

答案 1 :(得分:0)

with open(filename) as f:
    size = os.stat(filename).st_size
    f.seek(size - 4096)
    block = f.read(4096)
    # Find amount to truncate
    f.truncate(...)