我已经实现了一个执行此操作的功能。对于图片中的每一列,它采用最常见的元素,并从该列中的元素总数中减去。然后它取这些数字并总结它们。 This image shows what the function does.
有没有办法让它更快? 这是我的代码: -
def scoreMotifs(motifs):
'''This function computes the score of list of motifs'''
z = []
for i in range(len(motifs[0])):
y = ''
for j in range(len(motifs)):
y += motifs[j][i]
z.append(y)
print z
totalscore = 0
for string in z:
score = len(string)-max([string.count('A'),string.count('C'), string.count('G'), string.count('T')])
totalscore += score
return totalscore
motifs = ['GCG','AAG','AAG','ACG','CAA']
scoreMotifs(motifs)
['GAAAC', 'CAACA', 'GGGGA']
5
答案 0 :(得分:2)
好的,我使用line_profiler来分析您的代码:
from random import randrange
@profile
def scoreMotifs(motifs):
'''This function computes the score of list of motifs'''
z = []
for i in range(len(motifs[0])):
y = ''
for j in range(len(motifs)):
y += motifs[j][i]
z.append(y)
totalscore = 0
for string in z:
score = len(string)-max([string.count('A'),string.count('C'), string.count('G'), string.count('T')])
totalscore += score
return totalscore
def random_seq():
dna_mapping = ['T', 'A', 'C', 'G']
return ''.join([dna_mapping[randrange(4)] for _ in range(3)])
motifs = [random_seq() for _ in range(1000000)]
print scoreMotifs(motifs)
结果如下:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3
4
5
6 1 4 4.0 0.0
7 4 14 3.5 0.0
8 3 2 0.7 0.0
9 3000003 1502627 0.5 41.7
10 3000000 2075204 0.7 57.5
11 3 22 7.3 0.0
12 1 1 1.0 0.0
13 4 4 1.0 0.0
14 3 29489 9829.7 0.8
15 3 5 1.7 0.0
16 1 1 1.0 0.0
Total Time: 3.60737 s
使用以下内容进行大量计算:
y += motifs[j][i]
使用zip
技巧可以更好地转换字符串。因此,您可以将代码重写为:
from random import randrange
@profile
def scoreMotifs(motifs):
'''This function computes the score of list of motifs'''
z = zip(*motifs)
totalscore = 0
for string in z:
score = len(string)-max([string.count('A'),string.count('C'), string.count('G'), string.count('T')])
totalscore += score
return totalscore
def random_seq():
dna_mapping = ['T', 'A', 'C', 'G']
return ''.join([dna_mapping[randrange(4)] for _ in range(3)])
motifs = [random_seq() for _ in range(1000000)]
print scoreMotifs(motifs)
motifs = ['GCG','AAG','AAG','ACG','CAA']
print scoreMotifs(motifs)
总时间:
Total time: 0.61699 s
我说这是一个相当不错的改进。