use XML::LibXML;
use Data::Dumper;
#parsing file
my $dom = XML::LibXML->new->parse_file('sample.xml');
my $context = XML::LibXML::XPathContext->new( $dom->documentElement() );
$context->registerNs('u', 'http://uniprot.org/uniprot');
#print file to make sure it looks ok
print $dom, "\n";
#finds shortnames
my $sn = $context->findnodes('//u:shortName');
print 'ShortName: '.$sn, "\n";
#finds dbRefernce ids that are of type EC
my $ids = $context->findnodes('//u:dbReference[@type="EC"]/@id');
my $number =()= $ids =~ /\./gi;
print 'EC Values: '.$ids, "\n";
#finds sequences that have a length
my $seq = $context->findnodes('//u:sequence[@length>1]');
$seq =~ s/" "/"\n"/;
print 'Sequence: '.$seq, "\n";
我目前有这个代码,它运行在这个有10个标签(https://www.dropbox.com/s/dq8ir9f22cnfwrz/Sample.xml)的xml文件上。截至目前,它正在提取此xml文件中的10个条目的短名称,dbReference和序列,并将它们一起添加到打印中。我想做的是,它为xml文件中的每个条目都有一个短名称,dbReference和Sequence。是否可以让脚本为每个条目一次查找这些数据?我的最终目标是以特定的方式格式化它们以进行输出。
我正在考虑在此之前运行代码,它将仅提取条目,然后将它们发送到其余代码以进行数据提取。
由于
答案 0 :(得分:2)
您需要查询节点集(返回集合):
my @entries = $context->findnodes('//u:entry');
然后,为每个节点运行一个上下文XPath表达式findnodes(expression, context-node)
,将该节点作为第二个参数传递,例如:
foreach $entry (@entries) {
my $entryName = $context->findnodes('u:name', $entry);
...
}
以下是尝试使用您的代码:
use XML::LibXML;
use Data::Dumper;
#parsing file
my $dom = XML::LibXML->new->parse_file('sample.xml');
my $context = XML::LibXML::XPathContext->new( $dom->documentElement() );
$context->registerNs('u', 'http://uniprot.org/uniprot');
my @entries = $context->findnodes('//u:entry');
foreach $entry (@entries) {
my $entryName = $context->findnodes('u:name', $entry);
my @shortNames = $context->findnodes('.//u:shortName', $entry);
my @dbRefs = $context->findnodes('.//u:dbReference[@type="EC"]/@id', $entry);
my $sequence = $context->findnodes('.//u:sequence[@length>1]');
print "============================================================\n";
print "\nName: ".$entryName."\n";
print "\nShort Names: \n";
$i=0;
foreach $shortName (@shortNames) {
print ++$i.': '.$shortName->firstChild, "\n";
}
print "\nEC Values: \n";
$i=0;
foreach $dbRef (@dbRefs) {
print ++$i.': '.$dbRef->nodeValue, "\n";
}
$sequence =~ s/" "/"\n"/;
print "\nSequence: ".$sequence, "\n";
}
答案 1 :(得分:1)
看起来//sequence
是您的主要兴趣所在,因此您只需要迭代findnodes
返回的值:
for my $seq ($context->findnodes('//u:sequence[@length>1]')) {
print 'Sequence @length: '.$seq->getAttribute('length'). "\n";
# ...
}
然后你只需要拉相对于这个节点的其他值。要了解如何执行此操作,只需google XML::LibXML Namespace
,第三个结果就是perlmonks帖子:XML::LibXML and namespaces
for my $seq ($context->findnodes('//u:sequence[@length>1]')) {
print 'Sequence @length: '.$seq->getAttribute('length'). "\n";
my @sn = $context->findnodes('..//u:shortName', $seq);
print ' ShortName Count: '.@sn. "\n";
my @ids = $context->findnodes('..//u:dbReference[@type="EC"]/@id', $seq);
print ' EC Values Count: '.@ids. "\n";
}
输出(注意,并非每个seq都有shortName):
Sequence @length: 323
ShortName Count: 5
EC Values Count: 7
Sequence @length: 503
ShortName Count: 0
EC Values Count: 5
Sequence @length: 323
ShortName Count: 3
EC Values Count: 4
Sequence @length: 490
ShortName Count: 0
EC Values Count: 4
Sequence @length: 490
ShortName Count: 0
EC Values Count: 4
Sequence @length: 323
ShortName Count: 3
EC Values Count: 3
Sequence @length: 323
ShortName Count: 3
EC Values Count: 3
Sequence @length: 539
ShortName Count: 2
EC Values Count: 3
Sequence @length: 494
ShortName Count: 1
EC Values Count: 3
Sequence @length: 277
ShortName Count: 0
EC Values Count: 3
有关如何构建XPath的其他提示,请查看:XPath Examples