我有一个文件,里面有一个名单及其位置(开始 - 结束)。
我的脚本遍历该文件并按名称读取另一个带有信息的文件,以检查该行是否在这些位置之间,然后计算出来的内容。
此刻它会逐行读取整个第二个文件(60MB),检查它是否在开始/结束之间。对于第一个列表中的每个名称(约5000)。收集这些参数之间的数据而不是重读整个文件5000次的最快方法是什么?
第二个循环的示例代码:
for line in file:
if int(line.split()[2]) >= start and int(line.split()[2]) <= end:
Dosomethingwithline():
编辑:将文件加载到第一个循环上方的列表中并迭代,以提高速度。
with open("filename.txt", 'r') as f:
file2 = f.readlines()
for line in file:
[...]
for line2 in file2:
[...]
答案 0 :(得分:1)
您可以使用mmap module将该文件加载到内存中,然后迭代。
示例:
import mmap
# write a simple example file
with open("hello.txt", "wb") as f:
f.write(b"Hello Python!\n")
with open("hello.txt", "r+b") as f:
# memory-map the file, size 0 means whole file
mm = mmap.mmap(f.fileno(), 0)
# read content via standard file methods
print(mm.readline()) # prints b"Hello Python!\n"
# read content via slice notation
print(mm[:5]) # prints b"Hello"
# update content using slice notation;
# note that new content must have same size
mm[6:] = b" world!\n"
# ... and read again using standard file methods
mm.seek(0)
print(mm.readline()) # prints b"Hello world!\n"
# close the map
mm.close()
答案 1 :(得分:0)
也许切换你的循环?迭代文件外部循环,并在名称列表上迭代内部循环。
name_and_positions = [
("name_a", 10, 45),
("name_b", 2, 500),
("name_c", 96, 243),
]
with open("somefile.txt") as f:
for line in f:
value = int(line.split()[2])
for name, start, end in name_and_positions:
if start <= value <= end:
print("matched {} with {}".format(name, value))
答案 2 :(得分:0)
在我看来,您的问题不是重读文件,而是将长列表的切片与短列表匹配。正如其他答案所指出的,您可以使用普通列表或内存映射文件来加速您的程序。
如果你想使用特定的数据结构进一步加速,那么我建议你研究blist,特别是因为它在切片列表方面比标准的Python列表有更好的性能:他们声称 O(log n)而不是 O(n)。
我已经在大约10MB的列表上测量了近4倍的加速:
import random
from blist import blist
LINE_NUMBER = 1000000
def write_files(line_length=LINE_NUMBER):
with open('haystack.txt', 'w') as infile:
for _ in range(line_length):
infile.write('an example\n')
with open('needles.txt', 'w') as infile:
for _ in range(line_length / 100):
first_rand = random.randint(0, line_length)
second_rand = random.randint(first_rand, line_length)
needle = random.choice(['an example', 'a sample'])
infile.write('%s\t%s\t%s\n' % (needle, first_rand, second_rand))
def read_files():
with open('haystack.txt', 'r') as infile:
normal_list = []
for line in infile:
normal_list.append(line.strip())
enhanced_list = blist(normal_list)
return normal_list, enhanced_list
def match_over(list_structure):
matches = 0
total = len(list_structure)
with open('needles.txt', 'r') as infile:
for line in infile:
needle, start, end = line.split('\t')
start, end = int(start), int(end)
if needle in list_structure[start:end]:
matches += 1
return float(matches) / float(total)
根据IPython&#39; %time
命令衡量,blist
需要12秒,而普通list
需要46秒:
In [1]: import main
In [3]: main.write_files()
In [4]: !ls -lh *.txt
10M haystack.txt
233K needles.txt
In [5]: normal_list, enhanced_list = main.read_files()
In [8]: %time main.match_over(normal_list)
CPU times: user 44.9 s, sys: 1.47 s, total: 46.4 s
Wall time: 46.4 s
Out[8]: 0.005032
In [9]: %time main.match_over(enhanced_list)
CPU times: user 12.6 s, sys: 33.7 ms, total: 12.6 s
Wall time: 12.6 s
Out[9]: 0.005032