我正在尝试解析GBK文件。基本上,我需要返回与模式匹配的基因的基因座标签和产品名称。因此,如果我想搜索所有预测基因产物的主题,搜索词“预测”将返回:
/product="predicted semialdehyde dehydrogenase"
/locus_tag="ECDH10B_2481"
我已经能够返回/product
,但我无法弄清楚如何解析“向后”来抓住/locus_tag
。
这是我到目前为止所拥有的:
my $fasta_file = 'example.txt';
open(INPUT, $fasta_file) || die "ERROR: can't read input FASTA file: $!";
while ( <INPUT> ) {
if(/predicted/){
print $_;
}
}
&GT; example.txt中
gene complement(2525423..2526436)
/gene="usg"
/locus_tag="ECDH10B_2481"
CDS complement(2525423..2526436)
/gene="usg"
/locus_tag="ECDH10B_2481"
/codon_start=1
/transl_table=11
/product="predicted semialdehyde dehydrogenase"
/protein_id="ACB03477.1"
/db_xref="GI:169889770"
/db_xref="ASAP:AEC-0002184"
/translation="MSEGWNIAVLGATGAVGEALLETLAERQFPVGEIYALARNESAG
EQL"
gene complement(2526502..2527638)
/gene="pdxB"
/locus_tag="ECDH10B_2482"
CDS complement(2526502..2527638)
/gene="pdxB"
/locus_tag="ECDH10B_2482"
/codon_start=1
/transl_table=11
/product="erythronate-4-phosphate dehydrogenase"
/protein_id="ACB03478.1"
/db_xref="GI:169889771"
/db_xref="ASAP:AEC-0002185"
/translation="MKILVDENMPYARDLFSRLGEVTAVPGRPIPVAQLADADALMVR
SVTKVNESLLAGKPIKFVGTATAGTDHVDEAWLKQAGIGFSAAP"
答案 0 :(得分:1)
你不应该“向后解析”。您的/locus
代码是事件,匹配是另一个。你的逻辑应该运行
答案 1 :(得分:1)
只需记住遇到的最后一个基因座标记,如果预测就打印出来:
#!/usr/bin/perl
use warnings;
use strict;
my $fasta_file = 'example.txt';
open my $INPUT, '<', $fasta_file or die "ERROR: can't read input FASTA file: $!";
my $locus_tag;
while (<$INPUT>) {
if (/locus_tag/) {
$locus_tag = $_;
} elsif (/predicted/) {
print;
print $locus_tag;
}
}
答案 2 :(得分:0)
很难向后解析。通过解析每个完整条目然后确定它是否匹配,您将获得更好的服务。现在这项工作有点多了,但是当你想用基因数据做其他事情时它会非常有用。
我在下面使用的方法会在%entry
中构建条目。当它看到下一个“基因”行时,它处理该条目,在这种情况下检查产品匹配,并为下一个清除它。
我已将DATA
文件句柄用于测试目的,它会读取__DATA__
行之后的所有内容。
#!/usr/bin/env perl
use v5.10;
use strict;
use warnings;
my %entry;
while(my $line = <DATA>) {
# new entry, process the previous one and clear it
if( $line =~ m{^ gene \s+ complement \( (.*) \) }x ) {
process_entry(\%entry) if keys %entry;
%entry = ( complement => $1 );
}
elsif( $line =~ m{^CDS \s+ }x ) {
# ignore CDS lines for now
}
elsif( $line =~ m{^\s+/(\w+)=(.*)} ) {
$entry{$1} = $2;
}
else {
warn "Unknown line $line";
}
}
# Process the last one.
process_entry(\%entry) if keys %entry;
sub process_entry {
my $entry = shift;
say "MATCH! $entry->{locus_tag}" if $entry->{product} =~ /predicted/;
return;
}
__DATA__
gene complement(2525423..2526436)
/gene="usg"
/locus_tag="ECDH10B_2481"
CDS complement(2525423..2526436)
/gene="usg"
/locus_tag="ECDH10B_2481"
/codon_start=1
/transl_table=11
/product="predicted semialdehyde dehydrogenase"
/protein_id="ACB03477.1"
/db_xref="GI:169889770"
/db_xref="ASAP:AEC-0002184"
/translation="MSEGWNIAVLGATGAVGEALLETLAERQFPVGEIYALARNESAGEQL"
gene complement(2526502..2527638)
/gene="pdxB"
/locus_tag="ECDH10B_2482"
CDS complement(2526502..2527638)
/gene="pdxB"
/locus_tag="ECDH10B_2482"
/codon_start=1
/transl_table=11
/product="erythronate-4-phosphate dehydrogenase"
/protein_id="ACB03478.1"
/db_xref="GI:169889771"
/db_xref="ASAP:AEC-0002185"
/translation="MKILVDENMPYARDLFSRLGEVTAVPGRPIPVAQLADADALMVRSVTKVNESLLAGKPIKFVGTATAGTDHVDEAWLKQAGIGFSAAP"
或者,有几个Fasta readers on CPAN包括Bio::SeqReader::Fasta和Bio::DB::Fasta。