生物信息学:根据基因组字符串查找基因

时间:2016-02-25 01:41:05

标签: python python-3.x

生物学家使用字母A,C,T和G来模拟基因组。基因是基因组的替代物,其在三联体ATG之后开始并且在三联体TAG,TAA或TGA之前结束。此外,基因串的长度是3的倍数,基因不含任何三联体ATG,TAG,TAA和TGA。

理想情况下:

Enter a genome string: TTATGTTTTAAGGATGGGGCGTTAGTT #Enter   
TTT
GGGCGT
-----------------
Enter a genome string: TGTGTGTATAT
No Genes Were Found

到目前为止,我有:

def findGene(gene):
    final = ""
    genep = gene.split("ATG")
    for part in genep:
        for chr in part:
            for i in range(0, len(chr)):
                if genePool(chr[i:i + 3]) == 1:
                    break
                else:
                    final += (chr[i+i + 3] + "\n")
    return final

def genePool(part):
    g1 = "ATG"
    g2 = "TAG"
    g3 = "TAA"
    g4 = "TGA"
    if (part.count(g1) != 0) or (part.count(g2) != 0) or (part.count(g3) != 0) or (part.count(g4) != 0):
        return 1

def main():
    geneinput = input("Enter a genome string: ")
    print(findGene(geneinput))

main()
# TTATGTTTTAAGGATGGGGCGTTAGTT

我一直遇到错误

说实话,这对我来说真的不起作用 - 我认为这些代码行已经走到了尽头 - 一种新方法可能会有所帮助。

提前致谢!

我遇到的错误 -

Enter a genome string: TTATGTTTTAAGGATGGGGCGTTAGTT
Traceback (most recent call last):
  File "D:\Python\Chapter 8\Bioinformatics.py", line 40, in <module>
    main()
  File "D:\Python\Chapter 8\Bioinformatics.py", line 38, in main
    print(findGene(geneinput))
  File "D:\Python\Chapter 8\Bioinformatics.py", line 25, in findGene
    final += (chr[i+i + 3] + "\n")
IndexError: string index out of range

就像我之前说过的那样,我不确定我是否正在使用我当前的代码来解决问题 - 任何有伪代码的新想法都会受到赞赏!

1 个答案:

答案 0 :(得分:3)

可以使用regular expression

来完成
import re

pattern = re.compile(r'ATG((?:[ACTG]{3})+?)(?:TAG|TAA|TGA)')
pattern.findall('TTATGTTTTAAGGATGGGGCGTTAGTT')
pattern.findall('TGTGTGTATAT')

<强>输出

['TTT', 'GGGCGT']
[]

https://regex101.com/r/yI4tN9/3

中提取的解释
"ATG((?:[ACTG]{3})+?)(?:TAG|TAA|TGA)"g
    ATG matches the characters ATG literally (case sensitive)
    1st Capturing group ((?:[ACTG]{3})+?)
        (?:[ACTG]{3})+? Non-capturing group
            Quantifier: +? Between one and unlimited times, as few times as possible, expanding as needed [lazy]
            [ACTG]{3} match a single character present in the list below
                Quantifier: {3} Exactly 3 times
                ACTG a single character in the list ACTG literally (case sensitive)
    (?:TAG|TAA|TGA) Non-capturing group
        1st Alternative: TAG
            TAG matches the characters TAG literally (case sensitive)
        2nd Alternative: TAA
            TAA matches the characters TAA literally (case sensitive)
        3rd Alternative: TGA
            TGA matches the characters TGA literally (case sensitive)
    g modifier: global. All matches (don't return on first match)