我需要扫描一个大文件(2.2GB),基本上从每行中检索前两个单词。
这是我写的代码:
"""
A generator, given a file object and an integer N, it will return N lines at a time.
"""
def generate_N_Lines(fileObj, N):
lines = []
for i in range(N):
lines.append(fileObj.readline())
yield lines
while lines[0]:
lines = []
for j in range(N):
lines.append(fileObj.readline())
yield lines
yield None
def process_pileup(lines, res):
if not lines:
return
for line in lines:
data = str(line).split('\t')
res.append([data[0], data[1]])
def scanPileup(pileup_path):
res = []
with open(pileup_path) as f:
for lines in generate_N_Lines(f, 4):
process_pileup(lines, res)
print(res)
if __name__ == '__main__':
scanPileup(path/to/file)
问题是,这需要花很多时间才能运行-上一次我尝试花费了5个多小时,而我只是放弃并停止了它。 我想不出办法减少运行时间。 如何优化代码?
文件类型:txt
文件格式Pileup
例如一行:seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&