使用libXML / XPath提取和存储XML数据

时间:2014-05-25 03:33:23

标签: xml perl xpath

use XML::LibXML;
use Data::Dumper; 

#parsing file
my $dom = XML::LibXML->new->parse_file('sample.xml');

my $context = XML::LibXML::XPathContext->new( $dom->documentElement()  );
$context->registerNs('u', 'http://uniprot.org/uniprot');

#print file to make sure it looks ok
print $dom, "\n";

    #finds shortnames
    my $sn = $context->findnodes('//u:shortName');
    print 'ShortName: '.$sn, "\n";

    #finds dbRefernce ids that are of type EC
    my $ids = $context->findnodes('//u:dbReference[@type="EC"]/@id');   
    my $number =()= $ids =~ /\./gi;
    print 'EC Values: '.$ids, "\n";

    #finds sequences that have a length
    my $seq = $context->findnodes('//u:sequence[@length>1]');
    $seq =~ s/" "/"\n"/;
    print 'Sequence: '.$seq, "\n";

我目前有这个代码,它运行在这个有10个标签(https://www.dropbox.com/s/dq8ir9f22cnfwrz/Sample.xml)的xml文件上。截至目前,它正在提取此xml文件中的10个条目的短名称,dbReference和序列,并将它们一起添加到打印中。我想做的是,它为xml文件中的每个条目都有一个短名称,dbReference和Sequence。是否可以让脚本为每个条目一次查找这些数据?我的最终目标是以特定的方式格式化它们以进行输出。

我正在考虑在此之前运行代码,它将仅提取条目,然后将它们发送到其余代码以进行数据提取。

由于

2 个答案:

答案 0 :(得分:2)

您需要查询节点集(返回集合):

my @entries = $context->findnodes('//u:entry');

然后,为每个节点运行一个上下文XPath表达式findnodes(expression, context-node),将该节点作为第二个参数传递,例如:

foreach $entry (@entries) {
    my $entryName  = $context->findnodes('u:name', $entry);
    ...
}

以下是尝试使用您的代码:

use XML::LibXML;
use Data::Dumper; 

#parsing file
my $dom = XML::LibXML->new->parse_file('sample.xml');

my $context = XML::LibXML::XPathContext->new( $dom->documentElement()  );
$context->registerNs('u', 'http://uniprot.org/uniprot');

my @entries = $context->findnodes('//u:entry');
foreach $entry (@entries) {

    my $entryName  = $context->findnodes('u:name', $entry);
    my @shortNames = $context->findnodes('.//u:shortName', $entry);
    my @dbRefs     = $context->findnodes('.//u:dbReference[@type="EC"]/@id', $entry);
    my $sequence   = $context->findnodes('.//u:sequence[@length>1]');

    print "============================================================\n";
    print "\nName: ".$entryName."\n";

    print "\nShort Names: \n";
    $i=0;
    foreach $shortName (@shortNames) {
        print ++$i.': '.$shortName->firstChild, "\n";
    }

    print "\nEC Values: \n";
    $i=0;
    foreach $dbRef (@dbRefs) {
        print ++$i.': '.$dbRef->nodeValue, "\n";
    }

    $sequence =~ s/" "/"\n"/;
    print "\nSequence: ".$sequence, "\n";
}

答案 1 :(得分:1)

看起来//sequence是您的主要兴趣所在,因此您只需要迭代findnodes返回的值:

for my $seq ($context->findnodes('//u:sequence[@length>1]')) {
    print 'Sequence @length: '.$seq->getAttribute('length'). "\n";
    # ...
}

然后你只需要拉相对于这个节点的其他值。要了解如何执行此操作,只需google XML::LibXML Namespace,第三个结果就是perlmonks帖子:XML::LibXML and namespaces

for my $seq ($context->findnodes('//u:sequence[@length>1]')) {
    print 'Sequence @length: '.$seq->getAttribute('length'). "\n";

    my @sn = $context->findnodes('..//u:shortName', $seq);
    print '  ShortName Count: '.@sn. "\n";

    my @ids = $context->findnodes('..//u:dbReference[@type="EC"]/@id', $seq);   
    print '  EC Values Count: '.@ids. "\n";
}

输出(注意,并非每个seq都有shortName):

Sequence @length: 323
  ShortName Count: 5
  EC Values Count: 7
Sequence @length: 503
  ShortName Count: 0
  EC Values Count: 5
Sequence @length: 323
  ShortName Count: 3
  EC Values Count: 4
Sequence @length: 490
  ShortName Count: 0
  EC Values Count: 4
Sequence @length: 490
  ShortName Count: 0
  EC Values Count: 4
Sequence @length: 323
  ShortName Count: 3
  EC Values Count: 3
Sequence @length: 323
  ShortName Count: 3
  EC Values Count: 3
Sequence @length: 539
  ShortName Count: 2
  EC Values Count: 3
Sequence @length: 494
  ShortName Count: 1
  EC Values Count: 3
Sequence @length: 277
  ShortName Count: 0
  EC Values Count: 3

有关如何构建XPath的其他提示,请查看:XPath Examples