获得文件中总行数和行索引的有效方法

时间:2017-11-03 23:54:53

标签: python

以下是5次调用的cProfile输出:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    5    3.743    0.749    3.743    0.749 {posix.waitpid}
    6    0.756    0.126    0.756    0.126 {method 'readlines' of 'file' objects}
    5    0.070    0.014    0.070    0.014 {posix.read}
    5    0.058    0.012    0.058    0.012 {posix.fork} objects}

我需要运行整个过程5M次(可能更晚)。因此,我需要尽可能多的改进。

  • posix.waitpid是子进程调用的等待时间(我需要等到它完成并且输出就绪)。因此,我可能无法进一步改进它。

  • 我需要找到行startswith('xxx')的索引和文件中的总行数。有没有办法比open("yyy.txt")readlineswith open("yyy.txt") as f:更快地获取这些信息?

1 个答案:

答案 0 :(得分:1)

如果文件不是太大而无法放入内存,则可以一次读取整个文件而不是一次读取一行。然后,不是将数据拆分成行,而是找到您要查找的内容并计算换行符,以便为您提供项目所在的行。通过计算所有换行符来获取总计数。这是一个功能:

def find_line_fast(file_name, start):
    with open(file_name) as f:
        buf = f.read()
    found_at = -1
    # Find a line that starts with value of start.
    idx = buf.find('\n'+start)
    if idx != -1:
        # If found, count lines up to line where found.
        found_at = buf[:idx+1].count('\n') + 1
    # Return line found at, and total lines.
    return found_at, buf.count('\n')

以下是上述与readline和line splitting方法的基准比较。以上是最快的。

import datetime

def find_line_readline(file_name, start):
    count = 0
    found_at = -1
    with open(file_name) as f:
        for line in f:
            count += 1
            if found_at == -1 and line.startswith(start):
                found_at = count
    return found_at, count


def find_line_split(file_name, start):
    with open(file_name) as f:
        buf = f.read()
    found_at = -1
    for i, line in enumerate(buf.split('\n')):
        if line.startswith(start):
            found_at = i+1
            break
    return found_at, buf.count('\n')


def find_line_fast(file_name, start):
    with open(file_name) as f:
        buf = f.read()
    found_at = -1
    idx = buf.find('\n'+start)
    if idx != -1:
        found_at = buf[:idx+1].count('\n') + 1
    return found_at, buf.count('\n')


n = 100
fname = "boggle_dict.txt"
st = "zymotic"
for fn in (find_line_readline, find_line_split, find_line_fast):
    at, count = fn(fname, st)
    print fn.__name__, 'found "%s" on line: %d of %d' % (st, at, count)
    start = datetime.datetime.now()
    for i in xrange(n):
        fn(fname, st)
    print n, '*', fn.__name__, 'took', datetime.datetime.now() - start
    print

输出

find_line_readline found "zymotic" on line: 172819 of 172823
100 * find_line_readline took 0:00:14.289262

find_line_split found "zymotic" on line: 172819 of 172823
100 * find_line_split took 0:00:12.784887

find_line_fast found "zymotic" on line: 172819 of 172823
100 * find_line_fast took 0:00:01.144335