Question

如何仅查找每个帧的第一个start_codon。在下面的代码中，它给了我所有start_codon位置。

from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
def test(seq, start, stop):
    start = ["ATG"]
    start_codon_index = 0
    for frame in range(0,3):
        for i in range(frame, len(seq), 3):
            current_codon = seq[i:i+3]                            
            if current_codon in start:
                start_codons.append(start_codon_index)
    return start_codons

f = open("a.fa","r")
start = ["ATG"]
for record in SeqIO.parse(f,"fasta"):
    seq=record.seq
    name=record.id
    start_codons=test(seq, start, stop)
    print name, start_codons

Answer 1

如果您有DNA字符串，并且想要找到第一次出现的“ATG”序列，最简单的方法就是：

DNA = "ACCACACACCATATAATGATATATAGGAAATG"

print(DNA.find("ATG"))

打印出15，请注意python中的索引从0开始

如果您考虑核苷酸三联体：

DNA = "ACCACACACCATATAATGATATATAGGAAATG"
for i in range(0, len(DNA), 3):
    if DNA[i:i+3] == "ATG":
        print(i)
        break

同样返回15。

Answer 2

使用regex和re.match会很容易，因为re.match会尝试匹配字符串的开头并返回值，如果不匹配则将返回None。

例如，如果您要读取序列：

sequence = 'ATGTTGTGAGCGGATGGTTTAAT'
import re
index = 0
while index < len(sequence) - 6: # condiering that the least ORF contains 6 nts
    match = re.match('(ATG(?:\S{3})*?T(?:AG|AA|GA))', sequence[index:])
    if match:
        print('Find the first one', match.group())
        index += len(match.group())
        break
    else:
        index += 1

然后您将获得输出：

'Find the first one ATGTTGTGA'

在fasta中仅查找并打印第一个ORF

2 个答案: