我有一个程序,其目的是读取八个文件,这些文件长达一百万个字符,没有标点符号,只有一堆字符。
八个文件代表找到的四个DNA样本,程序所做的是从样本中的一个文件中获取字符,并将它们与同一样本的另一个文件中的字符组合。例如,如果file1读取:
abcdefg
和file2阅读:
hijklmn
组合将是:
ah, bi, cj, dk, el, fm, gn
无论如何,程序继续计算每对的组合数量,并打印出一个字典,例如:
{'mm': 52, 'CC': 66, 'SS': 24, 'cc': 19, 'MM': 26, 'ss': 58, 'TT': 43, 'tt': 32}
问题是,虽然程序适用于小文件,对于数百万字符长(是的,这是一个字面数字,而不是夸张)文件,程序挂起,似乎永远不会完成任务。 (我让它隔夜运行一次,没有任何结果。)
是溢出错误,还是我使用的方法对于大文件来说太小了?有没有更好的方法来解决这个问题?
我的代码:
import re
from collections import Counter
def ListStore(fileName):
'''Purpose, stores the contents of file into a single string'''
#old code left in for now
'''
with open(fileName, "r") as fin:
fileContents = fin.read().rstrip()
fileContents = re.sub(r'\W', '', fin.read())
'''
#opens up the file given to the function
fin = open(fileName,'r')
#reads the file into a string, strips out the newlines as well
fileContents = fin.read().rstrip()
#closes up the file
fin.close()
#splits up the fileContents into a list of characters
fileContentsList = list(fileContents)
#returns the string
return fileContentsList
def ListCombo(list1, list2):
'''Purpose: combines the two DNA lists into one'''
#creates an empty dictionary for list3
list3 = []
#combines the codes from one hlaf with their matching from the other
list3 = [''.join(pair) for pair in zip(list1, list2)]
return list3
def printResult(list):
'''stores the result of the combination in a dictionary'''
#stores the result into a dictionary
result = dict((i,list.count(i)) for i in list)
print (result)
return result
def main():
'''Purpose: Reads the contents of 8 files, and finds out how many
combinations exist'''
#first sample files
file_name = "a.txt"
file_name2 = "b.txt"
#second sample files
file_name3 = "c.txt"
file_name4 = "d.txt"
#third sample files
file_name5 = "e.txt"
file_name6 = "f.txt"
#fourth sample files
file_name7 = "g.txt"
file_name8 = "h.txt"
#Get the first sample ready
#store both sides into a list of characters
contentList = ListStore(file_name)
contentList2 = ListStore(file_name2)
#combine the two lists together
combo_list = ListCombo(contentList, contentList2)
#store the first sample results into a dictionary
SampleA = printResult(combo_list)
print (SampleA)
# ****Get the second sample ready****
#store both sides into a list of characters
contentList3 = ListStore(file_name3)
contentList4 = ListStore(file_name4)
#combine the two lists together
combo_list2 = ListCombo(contentList3, contentList4)
#store the first sample results into a dictionary
SampleB = printResult(combo_list2)
print (SampleB)
# ****Get the third sample ready****
#store both sides into a list of characters
contentList5 = ListStore(file_name5)
contentList6 = ListStore(file_name6)
#combine the two lists together
combo_list3 = ListCombo(contentList5, contentList6)
#store the third sample results into a dictionary
SampleC = printResult(combo_list3)
print (SampleC)
# ****Get the second sample ready****
#store both sides into a list of characters
contentList7 = ListStore(file_name7)
contentList8 = ListStore(file_name8)
#combine the two lists together
combo_list4 = ListCombo(contentList7, contentList8)
#store the fourth sample results into a dictionary
SampleD = printResult(combo_list4)
print (SampleD)
if __name__ == '__main__':
main()
答案 0 :(得分:2)
不要将整个内容读入内存。有没有需要。此外,zip()
已将您的字符串拆分为字符,因此您无需自己执行此操作。
这里的技巧是创建一个生成器,在以块的形式读取两个文件时将字符配对,这将是读取文件的最有效方法。
最后,使用collections.Counter()
来保持计数:
from functools import partial
from collections import Counter
with open(filename1, 'r') as file1, open(filename2, 'r') as file2:
chunked1 = iter(partial(file1.read, 1024), '')
chunked2 = iter(partial(file2.read, 1024), '')
counts = Counter(''.join(pair) for chunks in zip(chunked1, chunked2) for pair in zip(*chunks))
此处代码以1024字节的块读取;根据需要调整以获得最佳性能。文件中不超过2048个字节一次保存在内存中,在计算它们时会动态生成。
答案 1 :(得分:1)
在printResult
方法中,您浏览i
中的每个元素list
,并将值list.count(i)
分配给{{1}中的键i
1}}字典。
我不完全确定result
是如何工作的,但我相信它涉及搜索大部分列表,并计算元素数量count(i)
每次运行时间。在您的代码中,如果您有重复内容,例如在i
中,您将计算列表中有多少元素['aa','bb','aa']
两次,每次都会查看整个列表。这在长列表中非常耗时。
您只需要浏览一次列表,以便计算每种类型的元素数量。我建议您使用'aa'
,因为您可以使用默认值defaultdict
开始每个新的key
。
0
使用 from collections import defaultdict
result = defaultdict(int)
for i in list:
result[i] = result[i] + 1
print result
创建defaultdict
允许每个新int
开始使用值key
。然后,您可以遍历列表一次,每次找到每个对的值时添加0
。这样可以不止一次地删除列表。
答案 2 :(得分:1)
正如所写的那样,我个人认为你的程序不受I / O限制 - 即使是这样,将其分解为多个调用,即使是缓冲,也不会像读取整个事件一样快你正在做的记忆。也就是说,我不确定为什么你的程序需要这么长时间来处理大文件 - 它可能是它正在做的许多不需要的操作,因为字符串和列表都是序列,所以通常不需要从一个转换到另一个
这是程序的优化版本,删除了大部分冗余和/或不必要的内容。它实际上利用了代码中导入的collections.Counter
类,但从未使用过,即使它仍然将文件内容读入内存,它只会在处理每对文件所需的最短时间内保留这些内容。
from collections import Counter
import os
DATA_FOLDER = 'datafiles' # folder path to data files ('' for current dir)
def ListStore(fileName):
'''return contents of file as a single string with any newlines removed'''
with open(os.path.join(DATA_FOLDER, fileName), 'r') as fin:
return fin.read().replace('\n', '')
def ListCombo(seq1, seq2):
'''combine the two DNA sequences into one'''
# combines the codes from one half with their matching from the other
return [''.join(pair) for pair in zip(seq1, seq2)]
def CountPairs(seq):
'''counts occurences of pairs in the list of the combinations and stores
them in a Counter dict instance keyed by letter-pairs'''
return Counter(seq)
def PrintPairs(counter):
#print the results in the counter dictionary (in sorted order)
print('{' + ', '.join(('{}: {}'.format(pair, count)
for pair, count in sorted(counter.items()))) + '}')
def ProcessSamples(file_name1, file_name2):
# store both sides into a list of characters
contentList1 = ListStore(file_name1)
contentList2 = ListStore(file_name2)
# combine the two lists together
combo_list = ListCombo(contentList1, contentList2)
# count the sample results and store into a dictionary
counter = CountPairs(combo_list)
#print the results
PrintPairs(counter)
def main():
'''reads the contents of N pairs of files, and finds out how many
combinations exist in each'''
file_names = ('a.txt', 'b.txt',
'c.txt', 'd.txt',
'e.txt', 'f.txt',
'g.txt', 'h.txt',)
for (file_name1, file_name2) in zip(*([iter(file_names)]*2)):
ProcessSamples(file_name1, file_name2)
if __name__ == '__main__':
main()