Question

所以这就是问题所在。我正在尝试解析GenBank中的XML信息文件。该文件包含有关多个DNA序列的信息。我已经完成了genbacnk（TINY xml和INSD xml）的其他两种xml格式，但纯xml让我很头疼。这是我的计划应该如何运作的。下载xml格式化文件，其中包含GenBank中X序列的信息。运行我的perl脚本，逐行搜索该xml文件，并以fasta格式将我想要的信息打印到新文件。这是：＆gt; Sequence_name_and_information \ n sequence \ n＆gt; sequence_name ....以及on和on，直到您拥有xml文件中的所有序列。我的问题是，在纯xml中，序列本身位于基因的标识符或序列的基因座之前。序列的基因或基因座应与＆＃34;＆gt;＆＃34;在同一行。这是我打开文件并解析它的代码：

open( New_File, "+>$PWD_file/$new_file" ) or die "\n\nCouldn't create file. Check permissions on location.\n\n";

    while ( my $lines = <INSD> ) {
        foreach ($lines) {
            if (m/<INSDSeq_locus>.*<\/INSDSeq_locus>/) {
                $lines =~ s/<INSDSeq_locus>//g and $lines =~ s/<\/INSDSeq_locus>//g and $lines =~ s/[a-z, |]//g; #this last bit may cause a bug of removing the letters in the genbank accession number
                $lines =~ s/ //g;
                chomp($lines);
                print New_File ">$lines\_";
            } elsif (m/<INSDSeq_organism>.*<\/INSDSeq_organism>/) {
                $lines =~ s/<INSDSeq_organism>//g and $lines =~ s/<\/INSDSeq_organism>//g;
                $lines =~ s/(\.|\?|\-| )/_/g;
                $lines =~ s/_{2,}/_/g;
                $lines =~ s/_{1,}$//;
                $lines =~ s/^>*_{1,}//; 
                $lines =~ s/\s{2}//g;
                chomp($lines);
                print New_File "$lines\n";
            } elsif (m/<INSDSeq_sequence>.*<\/INSDSeq_sequence>/) {
                $lines =~ s/<INSDSeq_sequence>//g and $lines =~ s/<\/INSDSeq_sequence>//g;
                $lines =~ s/ //g;
                chomp($lines);
                print New_File "$lines\n";
            }
        }
    }
    close INSD;
    close New_File;
}

有两个地方可以找到基因/基因座信息。在以下两个标记之间可以找到该信息：LOCUS_NAME或GENE_NAME。会有一个或另一个。如果有信息，则另一个信息为空。在任何一种情况下，都需要添加到＆gt; .......行的末尾。

谢谢，

AlphaA

PS - 我试图将该信息打印到＆＃34;文件＆＃34;通过打开＆＃34; $ NA＆＃34;，＆＃34;＆gt;＆＃34;顺序，然后继续该程序，找到基因信息，将其打印到＆gt;然后读取$ NA文件并将其打印到＆gt;之后的行。线。我希望这很清楚。

Answer 1

使用XML解析器。我不是生物学家，我不确定你想要的最终格式，但它应该是简单的，作为一个起点。匿名子中的$_[1]包含一个哈希引用，从我上面可以看出，我认为您希望保存的所有内容都可以解析所需标记的父标记。以您希望的格式打印$ _ [1]的元素应该很容易：

use strict;
use warnings;

use XML::Rules;
use Data::Dumper;

my @rules = (
  _default => '',
  'INSDSeq_locus,INSDSeq_organism,INSDSeq_sequence' => 'content',
  INSDSeq  => sub { delete $_[1]{_content}; print Dumper $_[1]; return },
);

my $p = XML::Rules->new(rules => \@rules);
$p->parsefile('sequence.gbc.xml');

这就是为了打印你想要的标签很容易。或者，如果你想要一些其他的标签，我真正可能做的就是这个（如果你只是按元素打印，你根本不需要@tags变量）：

my @tags = qw(
  INSDSeq_locus
  INSDSeq_organism
  INSDSeq_sequence
);

my @rules = (
  _default => 'content',
  # Elements are, e.g. $_[1]{INSDSeq_locus}
  INSDSeq  => sub { print "$_: $_[1]{$_}\n" for @tags; return; },
);

使用：

my $p = XML::Rules->new(rules => \@rules, stripspaces => 4);

Answer 2

在我看来，您应该使用XSLT和XPath来导航到您需要的数据。

正如@Brian建议的那样，使用已建立的XML解析技术和库更容易。

甚至有Perl library for XSLT

逐行解析XML文件

2 个答案: