如何仅查找每个帧的第一个start_codon。在下面的代码中,它给了我所有start_codon位置。
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
def test(seq, start, stop):
start = ["ATG"]
start_codon_index = 0
for frame in range(0,3):
for i in range(frame, len(seq), 3):
current_codon = seq[i:i+3]
if current_codon in start:
start_codons.append(start_codon_index)
return start_codons
f = open("a.fa","r")
start = ["ATG"]
for record in SeqIO.parse(f,"fasta"):
seq=record.seq
name=record.id
start_codons=test(seq, start, stop)
print name, start_codons
答案 0 :(得分:1)
如果您有DNA字符串,并且想要找到第一次出现的“ATG”序列,最简单的方法就是:
DNA = "ACCACACACCATATAATGATATATAGGAAATG"
print(DNA.find("ATG"))
打印出15
,请注意python中的索引从0开始
如果您考虑核苷酸三联体:
DNA = "ACCACACACCATATAATGATATATAGGAAATG"
for i in range(0, len(DNA), 3):
if DNA[i:i+3] == "ATG":
print(i)
break
同样返回15
。
答案 1 :(得分:0)
使用regex和re.match会很容易,因为re.match会尝试匹配字符串的开头并返回值,如果不匹配则将返回None。
例如,如果您要读取序列:
sequence = 'ATGTTGTGAGCGGATGGTTTAAT'
import re
index = 0
while index < len(sequence) - 6: # condiering that the least ORF contains 6 nts
match = re.match('(ATG(?:\S{3})*?T(?:AG|AA|GA))', sequence[index:])
if match:
print('Find the first one', match.group())
index += len(match.group())
break
else:
index += 1
然后您将获得输出:
'Find the first one ATGTTGTGA'