我正在尝试打开多个文件的特定行并返回每个文件的行。我的解决方案耗费了大量时间。你有什么建议吗?
func.filename
:给定文件的名称
func.start_line
:给定文件中的起点
func.endline
:给定文件中的结束点
def method_open(func):
try:
body = open(func.filename).readlines()[func.start_line:
func.end_line]
except IOError:
body = []
stderr.write("\nCouldn't open the referenced method inside {0}".
format(func.filename))
stderr.flush()
return body
请记住,有时候打开的文件func.filename
可能是相同的,但不幸的是,大多数时候情况并非如此。
答案 0 :(得分:2)
readlines的问题在于它将整个文件读入内存,而linecache也是如此。
您可以通过一次读取一行并在到达func.endline后立即打破循环来节省一些时间
但我找到的最佳方法是使用itertools.islice
这是我在~9701k行的130MB文件上进行的一些测试的结果:
--- 1.43700003624 seconds --- f_readlines
--- 1.00099992752 seconds --- f_enumerate
--- 1.1400001049 seconds --- f_linecache
--- 0.0 seconds --- f_itertools_islice
在这里你可以找到我用过的脚本
import time
import linecache
import itertools
def f_readlines(filename, start_line, endline):
with open(filename) as f:
f.readlines()[5000:10000]
def f_enumerate(filename, start_line, endline):
result = []
with open(filename) as f:
for i, line in enumerate(f):
if i in range(start_line, endline):
result.append(line)
if i > endline:
break
def f_linecache(filename, start_line, endline):
result = []
for n in range(start_line, endline):
result.append(linecache.getline(filename, n))
def f_itertools_islice(filename, start_line, endline):
result = []
with open(filename) as f:
resultt = itertools.islice(f, start_line, endline)
for i in resultt:
result.append(i)
def runtest(func_to_test):
filename = "testlongfile.txt"
start_line = 5000
endline = 10000
start_time = time.time()
func_to_test(filename, start_line, endline)
print("--- %s seconds --- %s" % ((time.time() - start_time),func_to_test.__name__))
runtest(f_readlines)
runtest(f_enumerate)
runtest(f_linecache)
runtest(f_itertools_islice)