Question

我正在使用python3.2解析大小为1到10GB的日志文件，需要搜索具有特定正则表达式的行（某种时间戳），并且我想找到最后一次出现。

我试过用：

for line in reversed(list(open("filename")))

导致性能非常差（在好的情况下）和坏的情况下的MemoryError。

在线程中： Read a file in reverse order using python我没有找到任何好的答案。

我找到了以下解决方案： python head, tail and backward read by lines of a text file 非常有希望，但它不适用于python3.2的错误：

NameError: name 'file' is not defined

我之后尝试将File(file)替换为File(TextIOWrapper)，因为这是内置函数open()返回的对象，但这导致了更多错误（如果有人建议，我可以详细说明）是正确的方式：））

Answer 1

如果您不想阅读整个文件，可以随时使用seek。这是一个演示：

 $ cat words.txt 
foo
bar
baz
[6] oz123b@debian:~ $ ls -l words.txt 
-rw-r--r-- 1 oz123 oz123 12 Mar  9 19:38 words.txt

文件大小为12个字节。您可以通过向前移动光标8跳到最后一个条目：

In [3]: w=open("words.txt")
In [4]: w.seek(8)
In [5]: w.readline()
Out[5]: 'baz\n'

要完成我的回答，以下是反向打印这些行的方法：

 w=open('words.txt')

In [6]: for s in [8, 4, 0]:
   ...:     _= w.seek(s)
   ...:     print(w.readline().strip())
   ...:     
baz
bar
foo

您必须浏览文件的数据结构和每行的大小。我很简单，因为它是为了证明这一原则。

Answer 2

这是一个能够完成你正在寻找的功能

def reverse_lines(filename, BUFSIZE=4096):
    f = open(filename, "rb")
    f.seek(0, 2)
    p = f.tell()
    remainder = ""
    while True:
        sz = min(BUFSIZE, p)
        p -= sz
        f.seek(p)
        buf = f.read(sz) + remainder
        if '\n' not in buf:
            remainder = buf
        else:
            i = buf.index('\n')
            for L in buf[i+1:].split("\n")[::-1]:
                yield L
            remainder = buf[:i]
        if p == 0:
            break
    yield remainder

它的工作原理是从文件末尾读取缓冲区（默认为4kb）并反向生成其中的所有行。然后它向后移动4k并执行相同的操作，直到文件开头。如果正在处理的部分中没有换行（非常长的行），代码可能需要在内存中保留4k以上的数据。

您可以将代码用作

for L in reverse_lines("my_big_file"):
   ... process L ...

如何在python3.2中以相反的顺序读取文件而不将整个文件读入内存？

2 个答案: