Question

简单来说，我正在寻找使用正则表达式搜索字符串中的一组单词而不使用for循环的最快方法。即有没有办法做到这一点：

text = 'asdfadfgargqerno_TP53_dfgnafoqwefe_ATM_cvafukyhfjakhdfialb'
genes = set(['TP53','ATM','BRCA2'])
mutations = 0
if re.search( genes, text):
    mutations += 1
print mutations 
>>>1

之所以这样，是因为我正在搜索复杂的数据结构，并且不希望嵌套过多的循环。这是更详细的问题代码：

genes = set(['TP53','ATM','BRCA2'])
single_gene = 'ATM'
mutations = 0
data_dict = {
             sample1=set(['AAA','BBB','TP53'])
             sample2=set(['AAA','ATM','TP53'])
             sample3=set(['AAA','CCC','XXX'])
             sample4=set(['AAA','ZZZ','BRCA2'])
            }

for sample in data_dict:
    for gene in data_dict[sample] 
        if re.search( single_gene, gene):
            mutations += 1
            break

我可以轻松搜索“single_gene”，但我想搜索基因＆＃39;。如果我添加另一个for循环来迭代基因＆＃39;然后代码将变得更加复杂，因为我将不得不添加另一个＆＃39; break＆＃39;还有一个布尔值来控制中断发生的时间？功能上它有效，但非常笨重，必须有一个更优雅的方式来做到这一点？请参阅下面的集合中我笨重的额外循环（目前我唯一的解决方案）：

for sample in data_dict:
    for gene in data_dict[sample] 
        MUT = False
        for mut in genes:
            if re.search( mut, gene):
                mutations += 1
                MUT = True
                break
        if MUT == True:
            break

重要提示：我只想添加0或1到突变＆＃39;如果来自基因的任何基因＆＃39;发生在每个样本的集合中。即＆＃39;样本2＆＃39;将添加1添加到突变，样本3将添加0.让我知道是否有任何需要进一步澄清。提前谢谢！

Answer 1

如果目标字符串是固定文本（即非正则表达式），请不要使用re。它的效率要高得多：

for gene in genes:
    if gene in text:
        print('True')

该主题有各种变化，例如：

if [gene for gene in genes if gene in text]:
    ...

令人困惑的阅读，包含列表理解，并依赖于在Python中将空列表[]视为false的事实。

已更新，因为问题已更改：

你还在努力做到这一点。请考虑一下：

def find_any_gene(genes, text):
    """Returns True if any of the subsequences in genes
       is found within text.
    """
    for gene in genes:
        if gene in text:
           return True
    return False

mutations = 0
text = '...'

for sample in data_dict:
    for genes in data_dict[sample]
         if find_any_gene(genes, text):
             mutations += 1

这样做的优点是可以减少搜索短路所需的代码，提高可读性，并且其他代码可以调用函数find_any_gene()。

Answer 2

这有用吗？我在评论中使用了一些例子。

如果我亲近，请告诉我？！

genes = set(['TP53','ATM','BRCA2', 'aaC', 'CDH'])
mutations = 0
data_dict = {
             "sample1":set(['AAA','BBB','TP53']),
             "sample2":set(['AAA','ATM','TP53']),
             "sample3":set(['AAA','CCC','XXX']),
             "sample4":set(['123CDH47aaCDHzz','ZZZ','BRCA2'])
            }

for sample in data_dict:
    for gene in data_dict[sample]:
        if [ mut for mut in genes if mut in gene ]:
            print "Found mutation: "+str(gene),
            print "in sample: "+str(data_dict[sample])
            mutations += 1

print mutations

python搜索一组单词

2 个答案: