这是我从Stanford CoreNLP输出的输出XML文件的一部分:
<collapsed-ccprocessed-dependencies>
<dep type="nn">
<governor idx="25">Mullen</governor>
<dependent idx="24">Ms.</dependent>
</dep>
<dep type="nsubj">
<governor idx="26">said</governor>
<dependent idx="25">Mullen</dependent>
</dep>
</collapsed-ccprocessed-dependencies>
</sentence>
</sentences>
<coreference>
<coreference>
<mention representative="true">
<sentence>1</sentence>
<start>1</start>
<end>2</end>
<head>1</head>
</mention>
<mention>
<sentence>1</sentence>
<start>33</start>
<end>34</end>
<head>33</head>
</mention>
</coreference>
</coreference>
<mention representative="true">
<sentence>1</sentence>
<start>6</start>
<end>9</end>
<head>8</head>
</mention>
<mention>
<sentence>1</sentence>
<start>10</start>
<end>11</end>
<head>10</head>
</mention>
</coreference>
<coreference>
如何使用Perl解析它,以便得到类似的内容:
1. sentence 1, head 1
sentence 1, head 33
2. sentence 1, head 8
sentence 1, head 10
我尝试使用XML :: Simple但输出不容易理解。这是我做的: 使用XML :: Simple; 使用Data :: Dumper;
$outfile = $filename.".xml";
$xml = new XML::Simple;
$data = $xml -> XMLin($outfile);
print Dumper($data);
答案 0 :(得分:4)
XML :: Simple具有最难使用的接口。你可以使用像
这样的东西use XML::LibXML qw( );
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($xml);
my $coref_count;
for my $coref_node ($doc->findnodes('//coreference/coreference')) {
++$coref_count;
my $mention_count;
for my $mention_node ($coref_node->findnodes('mention')) {
++$mention_count;
my $sentence = $mention_node->findvalue('sentence/text()');
my $head = $mention_node->findvalue('head/text()');
my $prefix = "$coref_count.";
$prefix = ' ' x length($prefix) if $mention_count == 1;
print "$prefix sentence $sentence, head $head\n";
}
}
答案 1 :(得分:2)
令人遗憾的是,XML::Simple
首先赞成其Simple
命名空间的声明。它可能在实现上很简单,但在使用中并非如此简单,除非在最微不足道的情况下。如果你想要类似的东西,那么XML::Smart
提供了一个嵌套的数据结构API,但它做得更好。
值得庆幸的是,优秀的Perl XML模块有很多选择。 XML::Twig
是其中之一,它允许您指定在解析期间遇到XML数据中的特定元素时将执行的回调子例程。
此程序使用XML::Twig
,并在coreference[mention]
上设置回调,即coreference
元素至少有一个mention
子元素。
处理程序子例程中的代码不进行检查,并假设始终至少有两个mention
子元素,每个子元素都有sentence
和header
元素。这些节点的文本值以您描述的格式输出。
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new(twig_handlers => {
'coreference[mention]' => \&handle_coreference
});
$twig->parsefile('myxml.xml');
my $n;
sub handle_coreference {
my ($twig, $elt) = @_;
my @mentions = $elt->children('mention');
for my $i (0 .. $#mentions) {
printf "%s sentence %d, head %d\n",
$i == 0 ? sprintf '%3d.', ++$n : ' ',
map $mentions[$i]->first_child_trimmed_text($_), qw/ sentence head /;
}
}
<强>输出强>
1. sentence 1, head 1
sentence 1, head 33
2. sentence 1, head 8
sentence 1, head 10
答案 2 :(得分:0)
类似的东西:
use strict;
use warnings;
use XML::Rules;
my $mention_cnt;
my $ref_cnt = 1;
my @rules = (
coreference => sub {
$ref_cnt++ if $mention_cnt;
$mention_cnt = 0;
},
mention => sub {
my $d = $_[1];
my $str = $mention_cnt++ ? " " x 6 : sprintf("%-6s", "$ref_cnt.");
print "$str sentence: $d->{sentence} head: $d->{head}\n";
},
'sentence,head' => 'content',
);
my $xr = XML::Rules->new(
rules => \@rules,
);
$xr->parse($xml);