Question

我正在开发一个分析PSL文件的项目。该计划总体上看待阅读对象并识别圆形分子。我已经让程序工作了，但是我的操作是嵌套的这一事实使得读取整个PSL文件的时间超过10分钟而不是像应该的那样大约15秒就非常低效。

相关代码是：

def readPSLpairs(self):

    posread = []
    negread = []
    result = {}
    for psl in self.readPSL():
        parsed = psl.split()
        strand = parsed[9][-1]
        if strand == '1':
            posread.append(parsed)
        elif strand == '2':
            negread.append(parsed)

    for read in posread:
        posname = read[9][:-2]
        poscontig = read[13]
        for read in negread:
            negname = read[9][:-2]
            negcontig = read[13]
            if posname == negname and poscontig == negcontig:
                try:
                    result[poscontig] += 1
                    break
                except:
                    result[poscontig] = 1
                    break
    print(result)

我试图改变整体操作，而是将值附加到列表并尝试匹配posname = negname和poscontig = negcontig，但事实证明它比我想象的要难得多，所以我坚持尝试改善这一切的功能。

Answer 1

import collections

all_dict = {"pos": collections.defaultdict(int),
            "neg": collections.defaultdict(int)}

result = {}

for psl in self.readPSL():
    parsed = pls.split()
    strand = "pos" if parsed[9][-1]=='1' else "neg"
    name, contig = parsed[9][:-2], parsed[13]
    all_dict[strand][(name,contig)] += 1
# pre-process all the psl's into all_dict['pos'] or all_dict['neg']
#   this is basically just a `collections.Counter` of what you're doing already!

for info, posqty in all_dict['pos'].items():
    negqty = all_dict['neg'][info]  # (defaults to zero)
    result[info] = qty * other_qty
# process all the 'pos' psl's. For every match with a 'neg', set
#   result[(name, contig)] to the total (posqty * negqty)

请注意，这将丢弃整个解析的psl值，仅保留name和contig切片。

改进嵌套循环以提高效率

1 个答案: