Question

我有一个程序，其目的是读取八个文件，这些文件长达一百万个字符，没有标点符号，只有一堆字符。

八个文件代表找到的四个DNA样本，程序所做的是从样本中的一个文件中获取字符，并将它们与同一样本的另一个文件中的字符组合。例如，如果file1读取：

abcdefg

和file2阅读：

hijklmn

组合将是：

ah, bi, cj, dk, el, fm, gn

无论如何，程序继续计算每对的组合数量，并打印出一个字典，例如：

{'mm': 52, 'CC': 66, 'SS': 24, 'cc': 19, 'MM': 26, 'ss': 58, 'TT': 43, 'tt': 32}

问题是，虽然程序适用于小文件，对于数百万字符长（是的，这是一个字面数字，而不是夸张）文件，程序挂起，似乎永远不会完成任务。（我让它隔夜运行一次，没有任何结果。）

是溢出错误，还是我使用的方法对于大文件来说太小了？有没有更好的方法来解决这个问题？

我的代码：

import re
from collections import Counter

def ListStore(fileName):
    '''Purpose, stores the contents of file into a single string'''           

    #old code left in for now
    '''
    with open(fileName, "r") as fin:
        fileContents = fin.read().rstrip()
        fileContents = re.sub(r'\W', '', fin.read())
    '''
    #opens up the file given to the function
    fin = open(fileName,'r')

    #reads the file into a string, strips out the newlines as well
    fileContents = fin.read().rstrip()


    #closes up the file
    fin.close()

    #splits up the fileContents into a list of characters
    fileContentsList = list(fileContents)   

    #returns the string
    return fileContentsList


def ListCombo(list1, list2):
    '''Purpose: combines the two DNA lists into one'''


    #creates an empty dictionary for list3
    list3 = []

    #combines the codes from one hlaf with their matching from the other
    list3 = [''.join(pair) for pair in zip(list1, list2)]

    return list3


def printResult(list):
    '''stores the result of the combination in a dictionary'''




    #stores the result into a dictionary
    result = dict((i,list.count(i)) for i in list)

    print (result)
    return result


def main():

    '''Purpose: Reads the contents of 8 files, and finds out how many
    combinations exist'''


    #first sample files

    file_name = "a.txt"
    file_name2 = "b.txt"

    #second sample files
    file_name3 = "c.txt"
    file_name4 = "d.txt"

    #third sample files
    file_name5 = "e.txt"
    file_name6 = "f.txt"

    #fourth sample files
    file_name7 = "g.txt"
    file_name8 = "h.txt"


    #Get the first sample ready

    #store both sides into a list of characters

    contentList = ListStore(file_name)

    contentList2 = ListStore(file_name2)

    #combine the two lists together
    combo_list = ListCombo(contentList, contentList2)

    #store the first sample results into a dictionary
    SampleA = printResult(combo_list)

    print (SampleA)

    # ****Get the second sample ready****

    #store both sides into a list of characters
    contentList3 = ListStore(file_name3)
    contentList4 = ListStore(file_name4)

    #combine the two lists together
    combo_list2 = ListCombo(contentList3, contentList4)

    #store the first sample results into a dictionary
    SampleB = printResult(combo_list2)

    print (SampleB)

    # ****Get the third sample ready****

    #store both sides into a list of characters
    contentList5 = ListStore(file_name5)
    contentList6 = ListStore(file_name6)

    #combine the two lists together
    combo_list3 = ListCombo(contentList5, contentList6)

    #store the third sample results into a dictionary
    SampleC = printResult(combo_list3)

    print (SampleC)

    # ****Get the second sample ready****

    #store both sides into a list of characters
    contentList7 = ListStore(file_name7)
    contentList8 = ListStore(file_name8)

    #combine the two lists together
    combo_list4 = ListCombo(contentList7, contentList8)

    #store the fourth sample results into a dictionary
    SampleD = printResult(combo_list4)

    print (SampleD)



if __name__ == '__main__':
    main()

Answer 1

不要将整个内容读入内存。有没有需要。此外，zip() 已将您的字符串拆分为字符，因此您无需自己执行此操作。

这里的技巧是创建一个生成器，在以块的形式读取两个文件时将字符配对，这将是读取文件的最有效方法。

最后，使用collections.Counter()来保持计数：

from functools import partial
from collections import Counter

with open(filename1, 'r') as file1, open(filename2, 'r') as file2:
    chunked1 = iter(partial(file1.read, 1024), '')
    chunked2 = iter(partial(file2.read, 1024), '')
    counts = Counter(''.join(pair) for chunks in zip(chunked1, chunked2) for pair in zip(*chunks))

此处代码以1024字节的块读取;根据需要调整以获得最佳性能。文件中不超过2048个字节一次保存在内存中，在计算它们时会动态生成。

Answer 2

在printResult方法中，您浏览i中的每个元素list，并将值list.count(i)分配给{{1}中的键i 1}}字典。

我不完全确定result是如何工作的，但我相信它涉及搜索大部分列表，并计算元素数量count(i)每次运行时间。在您的代码中，如果您有重复内容，例如在i中，您将计算列表中有多少元素['aa','bb','aa']两次，每次都会查看整个列表。这在长列表中非常耗时。

您只需要浏览一次列表，以便计算每种类型的元素数量。我建议您使用'aa'，因为您可以使用默认值defaultdict开始每个新的key。

使用from collections import defaultdict result = defaultdict(int) for i in list: result[i] = result[i] + 1 print result创建defaultdict允许每个新int开始使用值key。然后，您可以遍历列表一次，每次找到每个对的值时添加0。这样可以不止一次地删除列表。

Answer 3

正如所写的那样，我个人认为你的程序不受I / O限制 - 即使是这样，将其分解为多个调用，即使是缓冲，也不会像读取整个事件一样快你正在做的记忆。也就是说，我不确定为什么你的程序需要这么长时间来处理大文件 - 它可能是它正在做的许多不需要的操作，因为字符串和列表都是序列，所以通常不需要从一个转换到另一个

这是程序的优化版本，删除了大部分冗余和/或不必要的内容。它实际上利用了代码中导入的collections.Counter类，但从未使用过，即使它仍然将文件内容读入内存，它只会在处理每对文件所需的最短时间内保留这些内容。

from collections import Counter
import os

DATA_FOLDER = 'datafiles' # folder path to data files ('' for current dir)

def ListStore(fileName):
    '''return contents of file as a single string with any newlines removed'''
    with open(os.path.join(DATA_FOLDER, fileName), 'r') as fin:
        return fin.read().replace('\n', '')

def ListCombo(seq1, seq2):
    '''combine the two DNA sequences into one'''
    # combines the codes from one half with their matching from the other
    return [''.join(pair) for pair in zip(seq1, seq2)]

def CountPairs(seq):
    '''counts occurences of pairs in the list of the combinations and stores
    them in a Counter dict instance keyed by letter-pairs'''
    return Counter(seq)

def PrintPairs(counter):
    #print the results in the counter dictionary (in sorted order)
    print('{' + ', '.join(('{}: {}'.format(pair, count)
        for pair, count in sorted(counter.items()))) + '}')

def ProcessSamples(file_name1, file_name2):
    # store both sides into a list of characters
    contentList1 = ListStore(file_name1)
    contentList2 = ListStore(file_name2)

    # combine the two lists together
    combo_list = ListCombo(contentList1, contentList2)

    # count the sample results and store into a dictionary
    counter = CountPairs(combo_list)

    #print the results
    PrintPairs(counter)

def main():
    '''reads the contents of N pairs of files, and finds out how many
    combinations exist in each'''
    file_names = ('a.txt', 'b.txt',
                  'c.txt', 'd.txt',
                  'e.txt', 'f.txt',
                  'g.txt', 'h.txt',)

    for (file_name1, file_name2) in zip(*([iter(file_names)]*2)):
        ProcessSamples(file_name1, file_name2)

if __name__ == '__main__':
    main()

Python3：我用来计算组合结果的方法太慢了吗？

3 个答案: