需要一些帮助,使用正则表达式找到DNA中的开放阅读框架?

时间:2013-09-12 18:04:45

标签: python regex eclipse

我试图找到所有可能的最小核苷酸长度的阅读框。

"A[TU]G(?:(...){3}){%d,}?(?:[TU]AG|[TU]AA|[TU]GA)" % (minimal_aa) 

这几乎可以满足我的需求,但出于某种原因,一些阅读框架并不承认某些终止密码子。

我确定它与(...)部分有关。我怎么告诉它总是停在[TU]AG|[TU]AA|[TU]GA,虽然通过多个起始密码子是好的。

我在Eclipse上使用Python。

我正在使用Pythex.org检查我的字符串,但这里是我正在谈论的样本:

AUGGAGAGCCUUGUUCUUGGUGUCAACGAGAAAACACACGUCCAACUCAGUUUGCCUGUCCUUCAGGUUAGAGACGUGCUAGUGCGUGGCUUCGGGGACUCUGUGGAAGAGGCCCUAUCGGAGGCACGUGAACACCUCAAAAAUGGCACUUGUGGUCUAGUAGAGCUGGAAAAAGGCGUACUGCCCCAGCUUGAACAGCCCUAUGUGUUCAUUAAACGUUCUGAUGCCUUAAGCACCAAUCACGGCCACAAGGUCGUUGAGCUGGUUGCAGAAAUGGACGGCAUUCAGUACGGUCGUAGCGGUAUAACACUGGGAGUACUCGUGCCACAUGUGGGCGAAACCCCAAUUGCAUACCGCAAUGUUCUUCUUCGUAAGAACGGUAAUAAGGGAGCCGGUGGUCAUAGCUAUGGCAUCGAUCUAAAGUCUUAUGACUUAGGUGACGAGCUUGGCACUGAUCCCAUUGAAGAUUAUGAACAAAACUGGAACACUAAGCAUGGCAGUGGUGCACUCCGUGAACUCACUCGUGAGCUCAAUGGAGGUGCAGUCACUCGCUAUGUCGACAACAAUUUCUGUGGCCCAGAUGGGUACCCUCUUGAUUGCAUCAAAGAUUUUCUCGCACGCGCGGGCAAGUCAAUGUGCACUCUUUCCGAACAACUUGAUUACAUCGAGUCGAAGAGAGGUGUCUACUGCUGCCGUGACCAUGAGCAUGAAAUUGCCUGGUUCACUGAGCGCUCUGAUAAGAGCUACGAGCACCAGACACCCUUCGAAAUUAAGAGUGCCAAGAAAUUUGACACUUUCAAAGGGGAAUGCCCAAAGUUUGUGUUUCCUCUUAACUCAAAAGUCAAAGUCAUUCAACCACGUGUUGAAAAGAAAAAGACUGAGGGUUUCAUGGGGCGUAUACGCUCUGUGUACCCUGUUGCAUCUCCACAGGAGUGUAACAAUAUGCACUUGUCUACCUUGAUGAAAUGUAAUCAUUGCGAUGAAGUUUCAUGGCAGA CGUGCGACUUUCUGAAAGCCACUUGUGAACAUUGUGGCACUGAAAAUUUAGUUAUUGAAGGACCUACUACAUGUGGGUACCUACCUACUAAUGCUGUAGUGAAAAUGCCAUGUCCUGCCUGUCAAGACCCAGAGAUUGGACCUGAGCAUAGUGUUGCAGAUUAUCACAACCACUCAAACAUUGAAACUCGACUCCGCAAGGGAGGUAGGACUAGAUGUUUUGGAGGCUGUGUGUUUGCCUAUGUUGGCUGCUAUAAUAAGCGUGCCUACUGGGUUCCUCGUGCUAGUGCUGAUAUUGGCUCAGGCCAUACUGGCAUUAA

等待。这是一个糟糕的例子。因为它现在实际上检查了我的眼球输出。我不得不缩短它,但有一个代码,几千个核苷酸,充满了终止密码子,没有任何工作正常。我希望你明白我的意思,如果不是不担心的话。

先谢谢amigos!

1 个答案:

答案 0 :(得分:1)

尝试使用此模式查找所有小的并最终重叠的序列:

(?=A[TU]G((?:.{3})+?)[TU](?:AG|AA|GA))

您可以在捕获组1中找到没有起始和终止密码子的每个序列。