基本上,GenBank文件包含基因条目('gene'后面跟着相应的'CDS'条目(每个基因只有一个),就像我在下面显示的两个一样。我想得到locus_tag vs product in a制表符分隔的两个列文件。'gene'和'CDS'总是先后跟空格。
A previous question suggested a script.
问题在于,似乎因为'product'在其名称中有时带有'/'字符,它与此脚本存在冲突,据我所知,使用'/'作为字段分隔符来存储数组中的信息?
我想解决这个问题,无论是修改此脚本还是构建其他脚本。
perl -nE'
BEGIN{ ($/, $") = ("CDS", "\t") }
say "@r[0,1]" if @r= m!/(?:locus_tag|product)="(.+?)"!g and @r>1
' file
gene complement(8972..9094)
/locus_tag="HAPS_0004"
/db_xref="GeneID:7278619"
CDS complement(8972..9094)
/locus_tag="HAPS_0004"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="YP_002474657.1"
/db_xref="GI:219870282"
/db_xref="GeneID:7278619"
/translation="MYYKALAHFLPTLSTMQNILSKSPLSLDFRLLFLAFIDKR"
gene 68..637
/locus_tag="HPNK_00040"
CDS 68..637
/locus_tag="HPNK_00040"
/codon_start=1
/transl_table=11
/product="NinG recombination protein/bacteriophage lambda
NinG family protein"
/protein_id="CRESA:HPNK_00040"
/translation="MIKPKVKKRKCKCCGGEFKSADSFRKWCSAECGVKLAKIAQEKA
RQKAIEKRNREERAKIKATRERLKSRSEWLKDAQAIFNEYIRLRDKDEPCISCRRFHQ
GQYHAGHYRTVKAMPELRFNEDNVHKQCSACNNHLSGNITEYRINLVRKIGAERVEAL
ESYHPPVKWSVEDCKEIIKTYRAKIKELK"
答案 0 :(得分:2)
由于您的示例GenBank文件不完整,我上网查找可以在示例中使用的示例文件,我找到了this file。
使用此代码和Bio::GenBankParser
模块,解析了猜测结构的哪些部分。在这种情况下,“功能”包含locus_tag
字段和product
字段。
use strict;
use warnings;
use feature 'say';
use Bio::GenBankParser;
my $file = shift;
my $parser = Bio::GenBankParser->new( file => $file );
while ( my $seq = $parser->next_seq ) {
my $feat = $seq->{'FEATURES'};
for my $f (@$feat) {
my $tag = $f->{'feature'}{'locus_tag'};
my $prod = $f->{'feature'}{'product'};
if (defined $tag and defined $prod) {
say join "\t", $tag, $prod;
}
}
}
<强>用法:强>
perl script.pl input.txt > output.txt
<强>输出:强>
MG_001 DNA polymerase III, beta subunit
MG_470 CobQ/CobB/MinD/ParA nucleotide binding domain-containing protein
同一输入的单行输出将是:
MG_001 DNA polymerase III, beta subunit
MG_470 CobQ/CobB/MinD/ParA nucleotide binding
domain-containing protein
当然假设您将/s
修饰符添加到正则表达式以考虑多行条目(评论中指出leeduhem):
m!/(?:locus_tag|product)="(.+?)"!sg
# ^---- this
答案 1 :(得分:1)
阅读了重复的问题http://www.biostars.org/p/94164/(请不要像这样发帖),这是一个最小的Biopython答案:
import sys
from Bio import SeqIO
filename = sys.argv[1] # Takes first command line argument input filename
for record in SeqIO.parse(filename, "genbank"):
for feature in record.features:
if feature.type == "CDS":
locus_tag = feature.qualifiers.get("locus_tag", ["???"])[0]
product = feature.qualifiers.get("product", ["???"])[0]
print("%s\t%s" % (locus_tag, product))
通过微小的更改,您可以将其写入文件中。