Question

我有一个gff文件，如下所示：

contig1 loci    gene    452050  453069  15  -   .   ID=dd_g4_1G94;
contig1 loci    mRNA    452050  453069  14  -   .   ID=dd_g4_1G94.1;Parent=dd_g4_1G94
contig1 loci    exon    452050  452543  .   -   .   ID=dd_g4_1G94.1.exon1;Parent=dd_g4_1G94.1
contig1 loci    exon    452592  453069  .   -   .   ID=dd_g4_1G94.1.exon2;Parent=dd_g4_1G94.1
contig1 loci    mRNA    452153  453069  15  -   .   ID=dd_g4_1G94.2;Parent=dd_g4_1G94
contig1 loci    exon    452153  452543  .   -   .   ID=dd_g4_1G94.2.exon1;Parent=dd_g4_1G94.2
contig1 loci    exon    452592  452691  .   -   .   ID=dd_g4_1G94.2.exon2;Parent=dd_g4_1G94.2
contig1 loci    exon    452729  453069  .   -   .   ID=dd_g4_1G94.2.exon3;Parent=dd_g4_1G94.2
###

我希望从0001开始重命名ID名称，这样对于上述基因，条目是：

contig1 loci    gene    452050  453069  15  -   .   ID=dd_0001;
contig1 loci    mRNA    452050  453069  14  -   .   ID=dd_0001.1;Parent=dd_0001
contig1 loci    exon    452050  452543  .   -   .   ID=dd_0001.1.exon1;Parent=dd_0001.1
contig1 loci    exon    452592  453069  .   -   .   ID=dd_0001.1.exon2;Parent=dd_0001.1
contig1 loci    mRNA    452153  453069  15  -   .   ID=dd_0001.2;Parent=dd_g4_1G94
contig1 loci    exon    452153  452543  .   -   .   ID=dd_0001.2.exon1;Parent=dd_0001.2
contig1 loci    exon    452592  452691  .   -   .   ID=dd_0001.2.exon2;Parent=dd_0001.2
contig1 loci    exon    452729  453069  .   -   .   ID=dd_0001.2.exon3;Parent=dd_0001.2

上述例子仅用于一个基因录入，但我希望从ID = dd_0001开始连续重命名所有基因及其相应的mRNA /外显子。任何关于如何做到这一点的提示将非常感激。

Answer 1

需要打开文件，然后逐行替换id 以下是file I/O和str.replace()的文档参考。

gff_filename = 'filename.gff'
replace_string = 'dd_g4_1G94'
replace_with = 'dd_0001'

lines = []
with open(gff_filename, 'r') as gff_file:
    for line in gff_file:
        line = line.replace(replace_string, replace_with)
        lines.append(line)

with open(gff_filename, 'w') as gff_file:
    gff_file.writelines(lines)

在Windows 10，Python 3.5.1中测试过，这是可行的。

要搜索ID，您应该使用regex。

import re

gff_filename = 'filename.gff'
replace_with = 'dd_{}'
re_pattern = r'ID=(.*?)[;\.]'

ids  = []
lines = []
with open(gff_filename, 'r') as gff_file:
    file_lines = [line for line in gff_file]

for line in file_lines:
    matches = re.findall(re_pattern, line)
    for found_id in matches:
        if found_id not in ids:
            ids.append(found_id)

for line in file_lines:
    for ID in ids:
        if ID in line:
            id_suffix = str(ids.index(ID)).zfill(4)
            line = line.replace(ID, replace_with.format(id_suffix))
    lines.append(line)

with open(gff_filename, 'w') as gff_file:
    gff_file.writelines(lines)

还有其他方法可以做到这一点，但这非常强大。

在gffile中重命名名称ID。

1 个答案: