我有一个gff3文件和一个FASTA基因组文件
gff3文件是这样的:
20 protein2genome exon 12005 12107 . - . ID=Blah_exon:1;Name=Blah:1;Parent=Blah
20 protein2genome exon 12108 12200 . - . ID=Blah_exon:2;Name=Blah:1;Parent=Blah
20 protein2genome exon 12005 12107 . - . ID=Blah_exon:3;Name=Blah:1;Parent=Blah
20 protein2genome exon 12342 12542 . - . ID=Blah_exon:4;Name=Blah:1;Parent=Blah
20 protein2genome exon 13005 13107 . - . ID=ABC_exon:1;Name=ABC:1;Parent=ABC
对于每个外显子,我需要编写一个名为gene_name_exon_number.fa
的新文件文件中的标题是:
gene_name exon_number chromosome_name exon_start exon_end
然后使用起始和结束数字在其染色体中的核苷酸碱基
所以第一个外显子文件看起来像是:
Blah 1 20 12005 12107 ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT
每个外显子都是一个新文件。
另外,我需要为每个基因创建一个CDS文件,其中包含名为gene_name_cds.fa的所有外显子
文件中的标题是:
gene_name number_of_exons chromosome_name cds_start cds_end cds_total_length
所以第一个CDS文件如下:
Blah 4 20 12005 12542 ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT
注意:核苷酸碱基只是粘贴在一起的外显子的所有核苷酸碱基。
最后,在包含数百个基因和数千个外显子的GFF3文件中,将有几个EEN基因的外显子文件和一个CDS文件。
然而,当整个基因组文件仅为130mb时,该文件正在gigabtye范围内生成CDS文件。所以问题必须是循环次序。有人可以帮我修复循环顺序吗?
我尝试使用几个for循环和prepender。到目前为止,我的代码看起来像这样:
import fileinput
input_file = "dummy.gf3"
search_file = "dummy.fa"
def line_pre_adder(filename, line_to_prepend):
f = fileinput.input(filename, inplace=1)
for xline in f:
if f.isfirstline():
print line_to_prepend.rstrip('\r\n') + '\n' + xline,
else:
print xline,
def exon_extractor():
exon_file = open(input_file)
for exon_line in exon_file:
previous = ""
CDS_exon_number = 1
CDS_start = int(0)
CDS_end = int(0)
NMID = exon_line.split()[-1].split()[0].rsplit(';')[0][3:].rsplit("_")[0]
exon_number = str(exon_line.split()[-1].split()[0].rsplit(';')[0][-1])
filename = str(exon_line.split()[-1].split()[0].rsplit(';')[0][3:-2]) + "_" + str(exon_line.split()[-1].split()[0].rsplit(';')[0][-1])
chromosome = exon_line.split()[0]
start = exon_line.split()[3]
end = str(int(exon_line.split()[4])+1)
exon_header = ">" + NMID + " " + exon_number + " " + start + " " + str(int(end)-1)
exon_file = open (str("exon_folder" + filename + ".fas"),"w+")
cds_file = open (str("test_folder" + NMID +"_CDS" + ".fas"),"a+")
exon_file.write(exon_header + '\n')
desired = int(end) - int(start)
with open (search_file,'r') as genome_file:
for genome_line in genome_file:
if str(chromosome + " ") in genome_line:
for genome_line in genome_file:
for i in range (int(desired)/60):
exon_file.write(genome_line.rstrip())
if NMID != previous:
cds_file.close()
line_pre_adder(str("test_folder" + NMID +"_CDS" + ".fas"), str(NMID + " " + str(CDS_exon_number) + " " + str(CDS_start) + " " + str(CDS_end) + " " + str(CDS_end-CDS_start)))
cds_file = open (str("test_folder" + NMID +"_CDS" + ".fas"),"a+")
cds_file.write(genome_line.rstrip())
previous = NMID
CDS_exon_number = exon_number
CDS_start += int(start)
CDS_end += (int(end)-1)
else:
cds_file.write(genome_line.rstrip())
CDS_exon_number = exon_number
CDS_start += int(start)
CDS_end += (int(end)-1)
if (int(desired) % 60) != 0:
exon_file.write(genome_line[0:(int(desired) % 60)+1].rstrip())
cds_file.write(genome_line[0:(int(desired) % 60)+1].rstrip())
break
else:
pass
else:
pass