我有一个gff文件,如下所示:
contig1 loci gene 452050 453069 15 - . ID=dd_g4_1G94;
contig1 loci mRNA 452050 453069 14 - . ID=dd_g4_1G94.1;Parent=dd_g4_1G94
contig1 loci exon 452050 452543 . - . ID=dd_g4_1G94.1.exon1;Parent=dd_g4_1G94.1
contig1 loci exon 452592 453069 . - . ID=dd_g4_1G94.1.exon2;Parent=dd_g4_1G94.1
contig1 loci mRNA 452153 453069 15 - . ID=dd_g4_1G94.2;Parent=dd_g4_1G94
contig1 loci exon 452153 452543 . - . ID=dd_g4_1G94.2.exon1;Parent=dd_g4_1G94.2
contig1 loci exon 452592 452691 . - . ID=dd_g4_1G94.2.exon2;Parent=dd_g4_1G94.2
contig1 loci exon 452729 453069 . - . ID=dd_g4_1G94.2.exon3;Parent=dd_g4_1G94.2
###
我希望从0001开始重命名ID名称,这样对于上述基因,条目是:
contig1 loci gene 452050 453069 15 - . ID=dd_0001;
contig1 loci mRNA 452050 453069 14 - . ID=dd_0001.1;Parent=dd_0001
contig1 loci exon 452050 452543 . - . ID=dd_0001.1.exon1;Parent=dd_0001.1
contig1 loci exon 452592 453069 . - . ID=dd_0001.1.exon2;Parent=dd_0001.1
contig1 loci mRNA 452153 453069 15 - . ID=dd_0001.2;Parent=dd_g4_1G94
contig1 loci exon 452153 452543 . - . ID=dd_0001.2.exon1;Parent=dd_0001.2
contig1 loci exon 452592 452691 . - . ID=dd_0001.2.exon2;Parent=dd_0001.2
contig1 loci exon 452729 453069 . - . ID=dd_0001.2.exon3;Parent=dd_0001.2
上述例子仅用于一个基因录入,但我希望从ID = dd_0001开始连续重命名所有基因及其相应的mRNA /外显子。 任何关于如何做到这一点的提示将非常感激。
答案 0 :(得分:1)
需要打开文件,然后逐行替换id 以下是file I/O和str.replace()的文档参考。
gff_filename = 'filename.gff'
replace_string = 'dd_g4_1G94'
replace_with = 'dd_0001'
lines = []
with open(gff_filename, 'r') as gff_file:
for line in gff_file:
line = line.replace(replace_string, replace_with)
lines.append(line)
with open(gff_filename, 'w') as gff_file:
gff_file.writelines(lines)
在Windows 10,Python 3.5.1中测试过,这是可行的。
要搜索ID,您应该使用regex。
import re
gff_filename = 'filename.gff'
replace_with = 'dd_{}'
re_pattern = r'ID=(.*?)[;\.]'
ids = []
lines = []
with open(gff_filename, 'r') as gff_file:
file_lines = [line for line in gff_file]
for line in file_lines:
matches = re.findall(re_pattern, line)
for found_id in matches:
if found_id not in ids:
ids.append(found_id)
for line in file_lines:
for ID in ids:
if ID in line:
id_suffix = str(ids.index(ID)).zfill(4)
line = line.replace(ID, replace_with.format(id_suffix))
lines.append(line)
with open(gff_filename, 'w') as gff_file:
gff_file.writelines(lines)
还有其他方法可以做到这一点,但这非常强大。