计算主题列表的得分

时间:2015-10-16 23:57:07

标签: python bioinformatics

我已经实现了一个执行此操作的功能。对于图片中的每一列,它采用最常见的元素,并从该列中的元素总数中减去。然后它取这些数字并总结它们。 This image shows what the function does.

有没有办法让它更快? 这是我的代码: -

def scoreMotifs(motifs):
'''This function computes the score of list of motifs'''
z = []
for i in range(len(motifs[0])):
    y = ''
    for j in range(len(motifs)):
        y += motifs[j][i]
    z.append(y)
print z
totalscore = 0
for string in z:
    score = len(string)-max([string.count('A'),string.count('C'), string.count('G'), string.count('T')])
    totalscore += score
return totalscore  

motifs = ['GCG','AAG','AAG','ACG','CAA']
scoreMotifs(motifs)
['GAAAC', 'CAACA', 'GGGGA']
5

1 个答案:

答案 0 :(得分:2)

好的,我使用line_profiler来分析您的代码:

from random import randrange

@profile
def scoreMotifs(motifs):
    '''This function computes the score of list of motifs'''
    z = []
    for i in range(len(motifs[0])):
        y = ''
        for j in range(len(motifs)):
            y += motifs[j][i]
        z.append(y)
    totalscore = 0
    for string in z:
        score = len(string)-max([string.count('A'),string.count('C'), string.count('G'), string.count('T')])
        totalscore += score
    return totalscore   

def random_seq():
    dna_mapping = ['T', 'A', 'C', 'G']
    return ''.join([dna_mapping[randrange(4)] for _ in range(3)])

motifs = [random_seq() for _ in range(1000000)]
print scoreMotifs(motifs)

结果如下:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           
     4                                           
     5                                           
     6         1            4      4.0      0.0  
     7         4           14      3.5      0.0  
     8         3            2      0.7      0.0  
     9   3000003      1502627      0.5     41.7  
    10   3000000      2075204      0.7     57.5  
    11         3           22      7.3      0.0  
    12         1            1      1.0      0.0  
    13         4            4      1.0      0.0  
    14         3        29489   9829.7      0.8  
    15         3            5      1.7      0.0  
    16         1            1      1.0      0.0  
Total Time: 3.60737 s

使用以下内容进行大量计算:

y += motifs[j][i]

使用zip技巧可以更好地转换字符串。因此,您可以将代码重写为:

from random import randrange

@profile
def scoreMotifs(motifs):
    '''This function computes the score of list of motifs'''
    z = zip(*motifs)
    totalscore = 0
    for string in z:
        score = len(string)-max([string.count('A'),string.count('C'), string.count('G'), string.count('T')])
        totalscore += score
    return totalscore  

def random_seq():
    dna_mapping = ['T', 'A', 'C', 'G']
    return ''.join([dna_mapping[randrange(4)] for _ in range(3)])


motifs = [random_seq() for _ in range(1000000)]
print scoreMotifs(motifs)

motifs = ['GCG','AAG','AAG','ACG','CAA']
print scoreMotifs(motifs)

总时间:

Total time: 0.61699 s

我说这是一个相当不错的改进。