从perl中的XML文件中提取属性和值

时间:2013-04-08 21:10:48

标签: perl xml-parsing stanford-nlp

这是我从Stanford CoreNLP输出的输出XML文件的一部分:

<collapsed-ccprocessed-dependencies>  
      <dep type="nn">
        <governor idx="25">Mullen</governor>
        <dependent idx="24">Ms.</dependent>
      </dep>
      <dep type="nsubj">
        <governor idx="26">said</governor>
        <dependent idx="25">Mullen</dependent>
      </dep>
    </collapsed-ccprocessed-dependencies>
  </sentence>
</sentences>
<coreference>
  <coreference>
    <mention representative="true">
      <sentence>1</sentence>
      <start>1</start>
      <end>2</end>
      <head>1</head>
    </mention>
    <mention>
      <sentence>1</sentence>
      <start>33</start>
      <end>34</end>
      <head>33</head>
    </mention>
  </coreference>
 </coreference>
<mention representative="true">
      <sentence>1</sentence>
      <start>6</start>
      <end>9</end>
      <head>8</head>
    </mention>
    <mention>
      <sentence>1</sentence>
      <start>10</start>
      <end>11</end>
      <head>10</head>
    </mention>
  </coreference>
  <coreference>   

如何使用Perl解析它,以便得到类似的内容:

1. sentence 1, head 1
   sentence 1, head 33
2. sentence 1, head 8
   sentence 1, head 10

我尝试使用XML :: Simple但输出不容易理解。这是我做的:     使用XML :: Simple;     使用Data :: Dumper;

$outfile = $filename.".xml";
$xml = new XML::Simple;

$data = $xml -> XMLin($outfile);
print Dumper($data);

3 个答案:

答案 0 :(得分:4)

XML :: Simple具有最难使用的接口。你可以使用像

这样的东西
use XML::LibXML qw( );

my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($xml);

my $coref_count;
for my $coref_node ($doc->findnodes('//coreference/coreference')) {
   ++$coref_count;

   my $mention_count;
   for my $mention_node ($coref_node->findnodes('mention')) {
      ++$mention_count;

      my $sentence = $mention_node->findvalue('sentence/text()');
      my $head     = $mention_node->findvalue('head/text()');

      my $prefix = "$coref_count.";
      $prefix = ' ' x length($prefix) if $mention_count == 1;

      print "$prefix sentence $sentence, head $head\n";
   }
}

答案 1 :(得分:2)

令人遗憾的是,XML::Simple首先赞成其Simple命名空间的声明。它可能在实现上很简单,但在使用中并非如此简单,除非在最微不足道的情况下。如果你想要类似的东西,那么XML::Smart提供了一个嵌套的数据结构API,但它做得更好。

值得庆幸的是,优秀的Perl XML模块有很多选择。 XML::Twig是其中之一,它允许您指定在解析期间遇到XML数据中的特定元素时将执行的回调子例程。

此程序使用XML::Twig,并在coreference[mention]上设置回调,即coreference元素至少有一个mention子元素。

处理程序子例程中的代码不进行检查,并假设始终至少有两个mention子元素,每个子元素都有sentenceheader元素。这些节点的文本值以您描述的格式输出。

use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig->new(twig_handlers => {
  'coreference[mention]' => \&handle_coreference
});
$twig->parsefile('myxml.xml');

my $n;
sub handle_coreference {

  my ($twig, $elt) = @_;

  my @mentions = $elt->children('mention');

  for my $i (0 .. $#mentions) {
    printf "%s sentence %d, head %d\n",
      $i == 0 ? sprintf '%3d.', ++$n : '    ',
      map $mentions[$i]->first_child_trimmed_text($_), qw/ sentence head /;
  }
}

<强>输出

  1. sentence 1, head 1
     sentence 1, head 33
  2. sentence 1, head 8
     sentence 1, head 10

答案 2 :(得分:0)

类似的东西:

use strict;
use warnings;

use XML::Rules;

my $mention_cnt;
my $ref_cnt = 1;
my @rules = (
  coreference => sub {
    $ref_cnt++ if $mention_cnt;
    $mention_cnt = 0;
  },
  mention => sub {
    my $d = $_[1];
    my $str = $mention_cnt++ ? " " x 6 : sprintf("%-6s", "$ref_cnt.");
    print "$str sentence: $d->{sentence} head: $d->{head}\n";
  },
  'sentence,head' => 'content',
);

my $xr = XML::Rules->new(
  rules => \@rules,
);
$xr->parse($xml);