Python3:我用来计算组合结果的方法太慢了吗?

时间:2013-07-30 13:21:28

标签: python python-3.x io

我有一个程序,其目的是读取八个文件,这些文件长达一百万个字符,没有标点符号,只有一堆字符。

八个文件代表找到的四个DNA样本,程序所做的是从样本中的一个文件中获取字符,并将它们与同一样本的另一个文件中的字符组合。例如,如果file1读取:

abcdefg

和file2阅读:

hijklmn

组合将是:

ah, bi, cj, dk, el, fm, gn

无论如何,程序继续计算每对的组合数量,并打印出一个字典,例如:

{'mm': 52, 'CC': 66, 'SS': 24, 'cc': 19, 'MM': 26, 'ss': 58, 'TT': 43, 'tt': 32}

问题是,虽然程序适用于小文件,对于数百万字符长(是的,这是一个字面数字,而不是夸张)文件,程序挂起,似乎永远不会完成任务。 (我让它隔夜运行一次,没有任何结果。)

是溢出错误,还是我使用的方法对于大文件来说太小了?有没有更好的方法来解决这个问题?

我的代码:

import re
from collections import Counter

def ListStore(fileName):
    '''Purpose, stores the contents of file into a single string'''           

    #old code left in for now
    '''
    with open(fileName, "r") as fin:
        fileContents = fin.read().rstrip()
        fileContents = re.sub(r'\W', '', fin.read())
    '''
    #opens up the file given to the function
    fin = open(fileName,'r')

    #reads the file into a string, strips out the newlines as well
    fileContents = fin.read().rstrip()


    #closes up the file
    fin.close()

    #splits up the fileContents into a list of characters
    fileContentsList = list(fileContents)   

    #returns the string
    return fileContentsList


def ListCombo(list1, list2):
    '''Purpose: combines the two DNA lists into one'''


    #creates an empty dictionary for list3
    list3 = []

    #combines the codes from one hlaf with their matching from the other
    list3 = [''.join(pair) for pair in zip(list1, list2)]

    return list3


def printResult(list):
    '''stores the result of the combination in a dictionary'''




    #stores the result into a dictionary
    result = dict((i,list.count(i)) for i in list)

    print (result)
    return result


def main():

    '''Purpose: Reads the contents of 8 files, and finds out how many
    combinations exist'''


    #first sample files

    file_name = "a.txt"
    file_name2 = "b.txt"

    #second sample files
    file_name3 = "c.txt"
    file_name4 = "d.txt"

    #third sample files
    file_name5 = "e.txt"
    file_name6 = "f.txt"

    #fourth sample files
    file_name7 = "g.txt"
    file_name8 = "h.txt"


    #Get the first sample ready

    #store both sides into a list of characters

    contentList = ListStore(file_name)

    contentList2 = ListStore(file_name2)

    #combine the two lists together
    combo_list = ListCombo(contentList, contentList2)

    #store the first sample results into a dictionary
    SampleA = printResult(combo_list)

    print (SampleA)

    # ****Get the second sample ready****

    #store both sides into a list of characters
    contentList3 = ListStore(file_name3)
    contentList4 = ListStore(file_name4)

    #combine the two lists together
    combo_list2 = ListCombo(contentList3, contentList4)

    #store the first sample results into a dictionary
    SampleB = printResult(combo_list2)

    print (SampleB)

    # ****Get the third sample ready****

    #store both sides into a list of characters
    contentList5 = ListStore(file_name5)
    contentList6 = ListStore(file_name6)

    #combine the two lists together
    combo_list3 = ListCombo(contentList5, contentList6)

    #store the third sample results into a dictionary
    SampleC = printResult(combo_list3)

    print (SampleC)

    # ****Get the second sample ready****

    #store both sides into a list of characters
    contentList7 = ListStore(file_name7)
    contentList8 = ListStore(file_name8)

    #combine the two lists together
    combo_list4 = ListCombo(contentList7, contentList8)

    #store the fourth sample results into a dictionary
    SampleD = printResult(combo_list4)

    print (SampleD)



if __name__ == '__main__':
    main()

3 个答案:

答案 0 :(得分:2)

不要将整个内容读入内存。有没有需要。此外,zip() 将您的字符串拆分为字符,因此您无需自己执行此操作。

这里的技巧是创建一个生成器,在以块的形式读取两个文件时将字符配对,这将是读取文件的最有效方法。

最后,使用collections.Counter()来保持计数:

from functools import partial
from collections import Counter

with open(filename1, 'r') as file1, open(filename2, 'r') as file2:
    chunked1 = iter(partial(file1.read, 1024), '')
    chunked2 = iter(partial(file2.read, 1024), '')
    counts = Counter(''.join(pair) for chunks in zip(chunked1, chunked2) for pair in zip(*chunks))

此处代码以1024字节的块读取;根据需要调整以获得最佳性能。文件中不超过2048个字节一次保存在内存中,在计算它们时会动态生成。

答案 1 :(得分:1)

printResult方法中,您浏览i中的每个元素list,并将值list.count(i)分配给{{1}中的键i 1}}字典。

我不完全确定result是如何工作的,但我相信它涉及搜索大部分列表,并计算元素数量count(i)每次运行时间。在您的代码中,如果您有重复内容,例如在i中,您将计算列表中有多少元素['aa','bb','aa']两次,每次都会查看整个列表。这在长列表中非常耗时。

您只需要浏览一次列表,以便计算每种类型的元素数量。我建议您使用'aa',因为您可以使用默认值defaultdict开始每个新的key

0

使用 from collections import defaultdict result = defaultdict(int) for i in list: result[i] = result[i] + 1 print result 创建defaultdict允许每个新int开始使用值key。然后,您可以遍历列表一次,每次找到每个对的值时添加0。这样可以不止一次地删除列表。

答案 2 :(得分:1)

正如所写的那样,我个人认为你的程序不受I / O限制 - 即使是这样,将其分解为多个调用,即使是缓冲,也不会像读取整个事件一样快你正在做的记忆。也就是说,我不确定为什么你的程序需要这么长时间来处理大文件 - 它可能是它正在做的许多不需要的操作,因为字符串和列表都是序列,所以通常不需要从一个转换到另一个

这是程序的优化版本,删除了大部分冗余和/或不必要的内容。它实际上利用了代码中导入的collections.Counter类,但从未使用过,即使它仍然将文件内容读入内存,它只会在处理每对文件所需的最短时间内保留这些内容。

from collections import Counter
import os

DATA_FOLDER = 'datafiles' # folder path to data files ('' for current dir)

def ListStore(fileName):
    '''return contents of file as a single string with any newlines removed'''
    with open(os.path.join(DATA_FOLDER, fileName), 'r') as fin:
        return fin.read().replace('\n', '')

def ListCombo(seq1, seq2):
    '''combine the two DNA sequences into one'''
    # combines the codes from one half with their matching from the other
    return [''.join(pair) for pair in zip(seq1, seq2)]

def CountPairs(seq):
    '''counts occurences of pairs in the list of the combinations and stores
    them in a Counter dict instance keyed by letter-pairs'''
    return Counter(seq)

def PrintPairs(counter):
    #print the results in the counter dictionary (in sorted order)
    print('{' + ', '.join(('{}: {}'.format(pair, count)
        for pair, count in sorted(counter.items()))) + '}')

def ProcessSamples(file_name1, file_name2):
    # store both sides into a list of characters
    contentList1 = ListStore(file_name1)
    contentList2 = ListStore(file_name2)

    # combine the two lists together
    combo_list = ListCombo(contentList1, contentList2)

    # count the sample results and store into a dictionary
    counter = CountPairs(combo_list)

    #print the results
    PrintPairs(counter)

def main():
    '''reads the contents of N pairs of files, and finds out how many
    combinations exist in each'''
    file_names = ('a.txt', 'b.txt',
                  'c.txt', 'd.txt',
                  'e.txt', 'f.txt',
                  'g.txt', 'h.txt',)

    for (file_name1, file_name2) in zip(*([iter(file_names)]*2)):
        ProcessSamples(file_name1, file_name2)

if __name__ == '__main__':
    main()