以下是5次调用的cProfile输出:
ncalls tottime percall cumtime percall filename:lineno(function)
5 3.743 0.749 3.743 0.749 {posix.waitpid}
6 0.756 0.126 0.756 0.126 {method 'readlines' of 'file' objects}
5 0.070 0.014 0.070 0.014 {posix.read}
5 0.058 0.012 0.058 0.012 {posix.fork} objects}
我需要运行整个过程5M次(可能更晚)。因此,我需要尽可能多的改进。
posix.waitpid
是子进程调用的等待时间(我需要等到它完成并且输出就绪)。因此,我可能无法进一步改进它。
我需要找到行startswith('xxx')
的索引和文件中的总行数。有没有办法比open("yyy.txt")
或readlines
或with open("yyy.txt") as f:
更快地获取这些信息?
答案 0 :(得分:1)
如果文件不是太大而无法放入内存,则可以一次读取整个文件而不是一次读取一行。然后,不是将数据拆分成行,而是找到您要查找的内容并计算换行符,以便为您提供项目所在的行。通过计算所有换行符来获取总计数。这是一个功能:
def find_line_fast(file_name, start):
with open(file_name) as f:
buf = f.read()
found_at = -1
# Find a line that starts with value of start.
idx = buf.find('\n'+start)
if idx != -1:
# If found, count lines up to line where found.
found_at = buf[:idx+1].count('\n') + 1
# Return line found at, and total lines.
return found_at, buf.count('\n')
以下是上述与readline和line splitting方法的基准比较。以上是最快的。
import datetime
def find_line_readline(file_name, start):
count = 0
found_at = -1
with open(file_name) as f:
for line in f:
count += 1
if found_at == -1 and line.startswith(start):
found_at = count
return found_at, count
def find_line_split(file_name, start):
with open(file_name) as f:
buf = f.read()
found_at = -1
for i, line in enumerate(buf.split('\n')):
if line.startswith(start):
found_at = i+1
break
return found_at, buf.count('\n')
def find_line_fast(file_name, start):
with open(file_name) as f:
buf = f.read()
found_at = -1
idx = buf.find('\n'+start)
if idx != -1:
found_at = buf[:idx+1].count('\n') + 1
return found_at, buf.count('\n')
n = 100
fname = "boggle_dict.txt"
st = "zymotic"
for fn in (find_line_readline, find_line_split, find_line_fast):
at, count = fn(fname, st)
print fn.__name__, 'found "%s" on line: %d of %d' % (st, at, count)
start = datetime.datetime.now()
for i in xrange(n):
fn(fname, st)
print n, '*', fn.__name__, 'took', datetime.datetime.now() - start
print
输出
find_line_readline found "zymotic" on line: 172819 of 172823
100 * find_line_readline took 0:00:14.289262
find_line_split found "zymotic" on line: 172819 of 172823
100 * find_line_split took 0:00:12.784887
find_line_fast found "zymotic" on line: 172819 of 172823
100 * find_line_fast took 0:00:01.144335