Question

我正在分析测序数据。在一个阶段，我有一个字典 {bcCountSorted} ，其中包含核苷酸序列（条形码）和这些条形码的列表 [bcSortedList] ，按出现频率排序。所以他们看起来像：

>>> head(bcCountSorted)
('CTAGACGGTGCCGAAGTC', 8)
('CTGGGTCCCCACCATCCC', 8)
('TACAGTGGACTAGGCGAA', 7)
('ATCTACGAGTCGTCCAAC', 7)
('AGTCTGCGGAAATGCCAA', 7)

>>> head(bcSortedList)
CCCGACATTCGAGCACGT
ACGACTTGTGATAAAACG
CAGCCTACAGTTCTCCCC
AAGTTGCCCGGACAGTTT
ACACCCACTTCCAGGATC

接下来，我需要从 bcSortedList 列表中的 bcCountSorted 字典中找到所有突变条形码变体，然后从 bcSortedList <中删除找到的突变变体/ em>的

以下代码执行此任务：

import regex for item in bcCountSorted: expr = regex.compile("({})".format(item[0]) + "{e<=" + str (barcodeError) + "}") tmpListBc = [bcSortedList.pop(bcSortedList.index(l)) for l in bcSortedList for m in [expr.search(l)] if m] if len(tmpListBc)> 0: tmpElementDict = [(x, bcCountSortedDict[x]) for x in tmpListBc] bcDict [tmpListBc[0]] = tmpElementDict

其中：

expr 是一个正则表达式，允许的错误级别为2（ barcodeError ），执行时间约为5 * 10 ^ -6秒

item [0] 是核苷酸序列，例如＆＃39; CTAGACGGTGCCGAAGTC＆＃39;

tmpListBc 是列表解析，找到序列，它们会立即从 bcSortedList 列表中删除。正在进行55秒。

因此问题。字典中的元素数量 bcCountSorted = 10,000,000，列表中的项目 bcSortedList ~7,500,000。由于一个循环在一个处理器模式下执行约55秒，为了在可接受的时间内执行任务，必须执行一个循环不超过0.004秒。

Python性能。在列表中搜索有错误的项目

0 个答案: