Question

我有大量（500,000行）日志文件，我为指定的部分解析。找到后，这些部分将打印到文本小部件。即使我将读取线切割到最后50,000行，也需要花费一分钟或更长的时间才能完成。

with open(i, "r") as f:
    r = f.readlines()
    r = r[-50000:]
    start = 0
    for line in r:
        if 'Start section' in line:
            if start == 1:
                cpfotxt.insert('end', line + "\n", 'hidden')
            start = 1
        if 'End section' in line:
            start = 0
            cpfotxt.insert('end', line + "\n")
        if start == 1:
            cpfotxt.insert('end', line + "\n")
f.close()

有什么方法可以更快地完成这项工作？

Answer 1

你应该尝试以块的形式阅读它。

with open(...) as f:
    for line in f:
        <do something with line>

可以适用于您的更明确的方法：

def readInChunks(fileObj, chunkSize=2048):
    """
    Lazy function to read a file piece by piece.
    Default chunk size: 2kB.
    """
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        yield data

f = open('bigFile')
for chuck in readInChunks(f):
    do_something(chunk)

Answer 2

另一种可能性是使用搜索来跳过很多行。但是，这需要您一些了解最后50K行的大小。而不是阅读所有早期的行，跳到接近结尾：

with ... as f:
    f.seek(-50000 * 80)
    # insert your processing here

如何使大文本文件的日志解析速度更快

2 个答案: