我想从文件中删除尾随空白行(如果有的话)。目前我通过在内存中读取它,删除那里的空白行并覆盖它来实现它。但文件很大(30000多行,长行),这需要2-3秒。
所以我想逐行读取文件,但是后退,直到我到达第一个非空白行。也就是说,我从最后一行开始,然后是倒数第二行,等等。然后我会截断它,而不是覆盖它。
反向读取它的最佳方法是什么?现在我正在考虑读取64k的块,然后循环遍历字符串反向,char-by-char,直到我得到一行,然后当我用完64k时,读取另一个64k并预先添加它们等等。
我假设没有标准函数或库以相反的顺序读取?
答案 0 :(得分:2)
这是我在别处找到的一些代码的修改版本(可能在StackOverflow上,实际上......) - 我已经提取了两个处理向后阅读的关键方法。
reversed_blocks
迭代器以您喜欢的大小向后读取文件,reversed_lines
迭代器将块分成行,保存第一个;如果下一个块以换行结束,则将其作为一个完整的行返回,如果它没有,则将保存的部分行追加到新块的最后一行,完成在块上拆分的行边界。
所有的状态都是由Python的迭代器机制维护的,所以我们不必在全球任何地方存储状态;这也意味着如果需要,您可以一次性向后读取多个文件,因为状态与迭代器相关联。
def reversed_lines(self, file):
"Generate the lines of file in reverse order."
newline_char_set = set(['\r', '\n'])
tail = ""
for block in self.reversed_blocks(file):
if block is not None and len(block)>0:
# First split the whole block into lines and reverse the list
reversed_lines = block.splitlines()
reversed_lines.reverse()
# If the last char of the block is not a newline, then the last line
# crosses a block boundary, and the tail (possible partial line from
# the previous block) should be added to it.
if block[-1] not in newline_char_set:
reversed_lines[0] = reversed_lines[0] + tail
# Otherwise, the block ended on a line boundary, and the tail is a
# complete line itself.
elif len(tail)>0:
reversed_lines.insert(0,tail)
# Within the current block, we can't tell if the first line is complete
# or not, so we extract it and save it for the next go-round with a new
# block. We yield instead of returning so all the internal state of this
# iteration is preserved (how many lines returned, current tail, etc.).
tail = reversed_lines.pop()
for reversed_line in reversed_lines:
yield reversed_line
# We're out of blocks now; if there's a tail left over from the last block we read,
# it's the very first line in the file. Yield that and we're done.
if len(tail)>0:
yield tail
def reversed_blocks(self, file, blocksize=4096):
"Generate blocks of file's contents in reverse order."
# Jump to the end of the file, and save the file offset.
file.seek(0, os.SEEK_END)
here = file.tell()
# When the file offset reaches zero, we've read the whole file.
while 0 < here:
# Compute how far back we can step; either there's at least one
# full block left, or we've gotten close enough to the start that
# we'll read the whole file.
delta = min(blocksize, here)
# Back up to there and read the block; we yield it so that the
# variable containing the file offset is retained.
file.seek(here - delta, os.SEEK_SET)
yield file.read(delta)
# Move the pointer back by the amount we just handed out. If we've
# read the last block, "here" will now be zero.
here -= delta
reversed_lines
是一个迭代器,所以你在循环中运行它:
for line in self.reversed_lines(fh):
do_something_with_the_line(line)
这些评论可能是多余的,但在我研究迭代器如何完成工作时,它们对我很有用。
答案 1 :(得分:0)
with open(filename) as f:
size = os.stat(filename).st_size
f.seek(size - 4096)
block = f.read(4096)
# Find amount to truncate
f.truncate(...)