我正在开发一个分析PSL文件的项目。该计划总体上看待阅读对象并识别圆形分子。我已经让程序工作了,但是我的操作是嵌套的这一事实使得读取整个PSL文件的时间超过10分钟而不是像应该的那样大约15秒就非常低效。
相关代码是:
def readPSLpairs(self):
posread = []
negread = []
result = {}
for psl in self.readPSL():
parsed = psl.split()
strand = parsed[9][-1]
if strand == '1':
posread.append(parsed)
elif strand == '2':
negread.append(parsed)
for read in posread:
posname = read[9][:-2]
poscontig = read[13]
for read in negread:
negname = read[9][:-2]
negcontig = read[13]
if posname == negname and poscontig == negcontig:
try:
result[poscontig] += 1
break
except:
result[poscontig] = 1
break
print(result)
我试图改变整体操作,而是将值附加到列表并尝试匹配posname = negname和poscontig = negcontig,但事实证明它比我想象的要难得多,所以我坚持尝试改善这一切的功能。
答案 0 :(得分:1)
import collections
all_dict = {"pos": collections.defaultdict(int),
"neg": collections.defaultdict(int)}
result = {}
for psl in self.readPSL():
parsed = pls.split()
strand = "pos" if parsed[9][-1]=='1' else "neg"
name, contig = parsed[9][:-2], parsed[13]
all_dict[strand][(name,contig)] += 1
# pre-process all the psl's into all_dict['pos'] or all_dict['neg']
# this is basically just a `collections.Counter` of what you're doing already!
for info, posqty in all_dict['pos'].items():
negqty = all_dict['neg'][info] # (defaults to zero)
result[info] = qty * other_qty
# process all the 'pos' psl's. For every match with a 'neg', set
# result[(name, contig)] to the total (posqty * negqty)
请注意,这将丢弃整个解析的psl值,仅保留name
和contig
切片。