如何提高效率? Python DNA生成器

时间:2018-11-26 14:56:36

标签: python performance ram memory-efficient

我有一个生成DNA的代码,然后将dna链复制很多次,然后在随机点切割每条线。我至少需要能够产生2万行,但这需要30分钟。我想知道是否有办法使这段代码更有效?谢谢

import sys
import numpy as NP
import fileinput
import re
import random

#Generate Random DNA Sequence

def random_dna_sequence(length):
    return ''.join(random.choice('ACTG') for each in range(length))
#DNA sequences with equal base probability

def base_frequency(dna):
    D = {}
    for base in 'ATCG':
        D[base] = dna.count(base)/float(len(dna))
    return D

for each in range(1):
    dna = random_dna_sequence(300)
    f= open("GeneratedDNA.txt", "w+")
    print(dna, file=f)
    f.close()
    f= open("OrigionalStrand.txt", "w+")
    print(dna, file=f)
    f.close()

Value =int(input("Enter How Many Replica Strands You Want to Generate: "))
for x in range(Value):
    with open("GeneratedDNA.txt") as f_in, open("GeneratedDNA.txt", "a") as f_out :
        for row in f_in.readlines()[-1:] :
            f_out.write(row)
            f_out.close()

min_no_space = 55 #minimum length without spaces
max_no_space = 75 # max sequence length without space
no_space = 0
with open("GeneratedDNA.txt","r") as f, 
open("GeneratedShortReads.txt","w") as w: 
    for line in f:
        for c in line:
            w.write(c)
            if no_space > min_no_space:
                if random.randint(1,9) == 1 or no_space >= max_no_space:
                    w.write("\n")
                    no_space = 0
            else:
                no_space += 1
    f.close()
    w.close()

2 个答案:

答案 0 :(得分:0)

  1. 请勿在循环中打开或关闭文件,而应在代码开头将文件数据加载到变量中,并将输出写入另一个变量,并在代码末尾将其写入文件中。
  2. li>
  3. 获取随机数据通常很昂贵。您可以一次加载1000个随机数,然后将其用作随机数生成器。
  4. 使用PyPy作为解释器,它比CPython快6倍:https://pypy.org/
  5. 如果这还不够,请使用比Python更快的语言。我建议使用Golang或C ++:https://dev.to/albertdugba/go-or-python-and-why-58ob

答案 1 :(得分:0)

如果您只是想从DNA序列中产生短读(例如Illumina或类似读物),请尝试这样做,它比您的代码要快得多

import numpy as np
def random_dna_sequence(length):
    return ''.join(random.choice('ACTG') for each in range(length))

我们将从一个500,000 bp长的随机DNA序列开始。从中我们将进行20,000个短读,平均长度为60bp,标准差为10bp:

seq_len = 500000
mean_read_length = 60
read_length_sd = 10
num_reads = 20000

my_dna = random_dna_sequence(seq_len)

# Generate random read lengths
read_lengths = [int(x) for x in np.random.normal(mean_read_length,read_length_sd,num_reads)]

# Generate random offsets
offsets = np.random.randint(0,seq_len,num_reads)

# Make the reads
reads = [my_dna[offset:offset+length] for offset,length in zip(offsets,read_lengths)]

# Add code to write reads to file ...