我需要计算序列中母题(包括重叠)的出现(基元在标准输入的第一行中传递,而序列在后续行中传递)。序列名称以>开头,在空格之后仅是有关需要忽略的序列的注释。程序输入如下:
AT
>seq1 Comment......
AGGTATA
TGGCGCC
>seq2 Comment.....
GGCCGGCGC
输出应为:
seq1: 2
seq2: 0
我决定将第一行保存为主题,从序列名称中删除注释,将序列中的行合并为一行,并将序列名称(键)和序列(值)保存在字典中。我还为motif_count写了一个函数,想在字典值上调用它,然后将其保存在单独的字典中以进行最终输出。我可以这样做还是有更好的方法?
#!/usr/bin/env python3
import sys
sequence = sys.stdin.readlines()
motif = sequence[0]
d = {}
temp_genename = None
temp_sequence = None
def motif_count(m, s):
count = 0
next_pos = -1
while True:
next_pos = s.find(m, next_pos + 1)
if next_pos < 0:
break
count += 1
return count
if sequence[1][0] != '>':
print("ERROR")
exit(1)
for line in sequence[1:]:
if line[0] == '>':
temp_genename = line.split(' ')[0].strip()
temp_sequence = ""
else:
temp_sequence += line.strip()
d[temp_genename] = temp_sequence
for value in d:
motif_count(motif, value)
答案 0 :(得分:0)
您可以使用字典和字符串表达式来简化代码,以获取处理所需的相关关键字。假设您的序列值是一致的并且与您提供的序列值相似,则可以拆分冗余的This sequence is from
,然后稍后过滤uppercase
字母,最后计算出主题的出现。可以按照以下步骤进行操作:
def motif_count(motif, key):
d[key] = d[key].count(motif)
sequence = """AT
>seq1 This sequence is from bacterial genome
AGGTATA
TGGCGCC
>seq2 This sequence is rich is CG
GGCCGGCGC""".split('\n')
d = {}
# print error if format is wrong
if sequence[1][0] != '>':
print("ERROR")
else:
seq = "".join(sequence).split('>')[1:]
func = lambda line: line.split(' This sequence is ')
d = dict((func(line)[0], ''.join([c for c in func(line)[1] if c.isupper()]))
for line in seq)
motif = sequence[0]
# replace seq with its count
for key in d:
motif_count(motif, key)
# print output
print(d)
输出:
{'seq1': 2, 'seq2': 0}