生成不包含特定序列的DNA序列

时间:2018-11-12 18:11:35

标签: python dna-sequence

我刚刚开始学习python编程。在课堂上,我们被要求生成一个随机的DNA序列,该序列不包含特定的6个字母的序列(AACGTT)。关键是要创建一个始终返回合法序列的函数。目前,我的函数大约有78%的时间生成正确的序列。我如何使其在100%的时间内返回合法的合法地位?任何帮助表示赞赏。

这是我的代码现在的样子:

from random import choice
def generate_seq(length, enzyme):
    list_dna = []
    nucleotides = ["A", "C", "T", "G"]
    i = 0
    while i < 1000:
        nucleotide = choice(nucleotides)
        list_dna.append(nucleotide)
        i = i + 1

    dna = ''.join(str(nucleotide) for nucleotide in list_dna)
    return(dna) 


seq = generate_seq(1000, "AACGTT")
if len(seq) == 1000 and seq.count("AACGTT") == 0:
    print(seq)

2 个答案:

答案 0 :(得分:1)

一种选择是检查循环中的最后几个条目,并且仅在未创建“不良”序列时才继续追加。但是,此可能导致具有“ AACGT”序列的机会比真正随机的机会高,只是使用不同的字母而不是最后一个“ T”

from random import choice
def generate_seq(length, enzyme):
    list_dna = []
    nucleotides = ["A", "C", "T", "G"]
    i = 0
    while i < 1000:
        nucleotide = choice(nucleotides)
        list_dna.append(nucleotide)
        #check for invalid sequence. If found, remove last element and redraw
        if ''.join(list_dna[-6:]) == "AACGTT":
            list_dna.pop()
        else:
            i = i + 1

    dna = ''.join(str(nucleotide) for nucleotide in list_dna)
    return(dna) 


seq = generate_seq(1000, "AACGTT")
if len(seq) == 1000 and seq.count("AACGTT") == 0:
    print(seq)

答案 1 :(得分:1)

一个想法是检查前5个核苷酸是否等于AACGT,在这种情况下,只能从["A", "C", "G"]中选择。

from random import choice


def generate_seq(length, enzyme, bad_prefix="AACGT"):
    list_dna = []
    nucleotides = ["A", "C", "T", "G"]
    i = 0
    while i < 1000:
        if list_dna[-5:] != bad_prefix:
            nucleotide = choice(nucleotides)
        else:
            nucleotide = choice(["A", "C", "G"])
        list_dna.append(nucleotide)
        i = i + 1

    dna = ''.join(str(nucleotide) for nucleotide in list_dna)
    return dna


seq = generate_seq(1000, "AACGTT")
if len(seq) == 1000 and seq.count("AACGTT") == 0:
    print(seq)