如何将FASTA文件分成多个外显子和CDS文件?

时间:2016-09-27 20:47:44

标签: python string function file loops

我有一个gff3文件和一个FASTA基因组文件

gff3文件是这样的:

20 protein2genome exon 12005 12107 . - . ID=Blah_exon:1;Name=Blah:1;Parent=Blah

20 protein2genome exon 12108 12200 . - . ID=Blah_exon:2;Name=Blah:1;Parent=Blah

20 protein2genome exon 12005 12107 . - . ID=Blah_exon:3;Name=Blah:1;Parent=Blah

20 protein2genome exon 12342 12542 . - . ID=Blah_exon:4;Name=Blah:1;Parent=Blah

20 protein2genome exon 13005 13107 . - . ID=ABC_exon:1;Name=ABC:1;Parent=ABC

对于每个外显子,我需要编写一个名为gene_name_exon_number.fa

的新文件

文件中的标题是:

  

gene_name exon_number chromosome_name exon_start exon_end

然后使用起始和结束数字在其染色体中的核苷酸碱基

所以第一个外显子文件看起来像是:

  

Blah 1 20 12005 12107   ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT

每个外显子都是一个新文件。

另外,我需要为每个基因创建一个CDS文件,其中包含名为gene_name_cds.fa的所有外显子

文件中的标题是:

  

gene_name number_of_exons chromosome_name cds_start cds_end cds_total_length

所以第一个CDS文件如下:

  

Blah 4 20 12005 12542   ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT   ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT

注意:核苷酸碱基只是粘贴在一起的外显子的所有核苷酸碱基。

最后,在包含数百个基因和数千个外显子的GFF3文件中,将有几个EEN基因的外显子文件和一个CDS文件。

然而,当整个基因组文件仅为130mb时,该文件正在gigabtye范围内生成CDS文件。所以问题必须是循环次序。有人可以帮我修复循环顺序吗?

我尝试使用几个for循环和prepender。到目前为止,我的代码看起来像这样:

import fileinput

input_file = "dummy.gf3"
search_file = "dummy.fa"

def line_pre_adder(filename, line_to_prepend):
    f = fileinput.input(filename, inplace=1)
    for xline in f:
        if f.isfirstline():
            print line_to_prepend.rstrip('\r\n') + '\n' + xline,
        else:
           print xline,

def exon_extractor():

exon_file = open(input_file)
for exon_line in exon_file:
    previous = ""
    CDS_exon_number = 1
    CDS_start = int(0)
    CDS_end = int(0)
    NMID = exon_line.split()[-1].split()[0].rsplit(';')[0][3:].rsplit("_")[0]
    exon_number = str(exon_line.split()[-1].split()[0].rsplit(';')[0][-1])
    filename = str(exon_line.split()[-1].split()[0].rsplit(';')[0][3:-2]) + "_" + str(exon_line.split()[-1].split()[0].rsplit(';')[0][-1])
    chromosome = exon_line.split()[0]
    start = exon_line.split()[3]
    end = str(int(exon_line.split()[4])+1)
    exon_header = ">" + NMID + " " + exon_number + " " + start + " " + str(int(end)-1)
    exon_file = open (str("exon_folder" + filename + ".fas"),"w+")
    cds_file = open (str("test_folder" + NMID +"_CDS" + ".fas"),"a+")
    exon_file.write(exon_header + '\n')
    desired = int(end) - int(start)
    with open (search_file,'r') as genome_file:         
        for genome_line in genome_file:
            if str(chromosome + " ") in genome_line:
                for genome_line in genome_file:
                    for i in range (int(desired)/60):
                        exon_file.write(genome_line.rstrip())

                        if NMID != previous:
                            cds_file.close()
                            line_pre_adder(str("test_folder" + NMID +"_CDS" + ".fas"), str(NMID + " " + str(CDS_exon_number) + " " + str(CDS_start) + " " + str(CDS_end) + " " + str(CDS_end-CDS_start)))
                            cds_file = open (str("test_folder" + NMID +"_CDS" + ".fas"),"a+")
                            cds_file.write(genome_line.rstrip())
                            previous = NMID
                            CDS_exon_number = exon_number
                            CDS_start += int(start)
                            CDS_end += (int(end)-1)
                        else:
                            cds_file.write(genome_line.rstrip())
                            CDS_exon_number = exon_number
                            CDS_start += int(start)
                            CDS_end += (int(end)-1)

                    if (int(desired) % 60) != 0:
                        exon_file.write(genome_line[0:(int(desired) % 60)+1].rstrip())
                        cds_file.write(genome_line[0:(int(desired) % 60)+1].rstrip())
                        break

                    else:
                        pass
            else:
                pass

0 个答案:

没有答案