Question

我是一个2周大的Perl用户，我正在尝试解析一个300 MB的嵌套XML文件。所以请原谅我缺乏知识。该文件遵循以下类似的格式

<?xml version="1.0" encoding="UTF-8"?>
   <APP:Report xsi:schemaLocation="WWW" xmlns:xsi="WWW" xmlns:APP="WWW">
   <library>
    <elt>
     <Book>The book of pages</Book>
     <Snap></Snap>
     <Line1>The Beginning</Line1>
     <Line2>We ceased to exist</Line2>
     <Line3>Accept it</Line3>
     <Line4>Now we live</Line4>
     <Line5>We reject it</Line5>
     <Rating>
      <C1>6.1</C1>
      <C2>8.9</C2>
      <C3>9.4</C3>
     </Rating>
    </elt>
    <Author>Sally</Author>
    <Publisher>Penguin</Publisher>
    <elt>
     <Book>The song</Book>
     <Snap></Snap>
     <Line1>This is how we do it</Line1>
     <Line2>I hope this works</Line2>
     <Line3>Please do</Line3>
     <Line4>Begging you</Line4>
     <Line5>Bye</Line5>
     <Rating>
      <C1>2.3</C1>
      <C2>9.9</C2>
      <C3>4.5</C3>
     </Rating>
    </elt>
    <Author>Justin</Author>
    <Publisher>Victoria</Publisher>
   </library>
  </APP:Report>

我希望能够在第一行的不同列中显示Book，Snap，Line1，Line2，line3，Line4，line5，C1，C2和C3，第2行的作者和第3行的Publisher。只是我拥有的大文件的一个示例。我不想访问要显示的特定子项。我希望能够显示它的所有后代。

目前它正在打印我的所有数据第1行第1列。我的代码片段如下所示。最好的方法是什么？我很感激任何建议。谢谢！

    my $twig= new XML::Twig();
$twig->parsefile( $_);    # build the twig 
  foreach my $elt ($twig->root->children)
  {
  print $fout1 $elt->text."\n";
}

编辑问题：如果我在嵌套子项中嵌套了子项怎么办？这样做效率最高的是什么？例如，如何访问每个C的elt元素？我的第二个问题是如何显示这些元素，如

  The book of pages|Snap|Line1|Line2|Line3|Line4|Line5|C1.X|
  The book of pages|Snap|Line1|Line2|Line3|Line4|Line5|C1.Y|
  The book of pages|Snap|Line1|Line2|Line3|Line4|Line5|C2.X|
  The book of pages|Snap|Line1|Line2|Line3|Line4|Line5|C2.Y|
  The book of pages|Snap|Line1|Line2|Line3|Line4|Line5|C3.X|
  The book of pages|Snap|Line1|Line2|Line3|Line4|Line5|C3.Y|
  .
  .
  .
  .
  .
  The song|Snap|Line1|Line2|Line3|Line4|Line5|C2.X|
  The song|Snap|Line1|Line2|Line3|Line4|Line5|C2.Y|
  Example 
    <Rating>
      <C1>
        <elt>
         <X></X>
         <X></X>
         </elt>
        <elt>
        <elt>
      </C1>
      <C2>
        <elt>
        <elt>
        <elt>
      </C2>
      <C3>
        <elt>
        <elt>
        <elt>
      </C3>
     </Rating>

与ikegami建议一样，最简单的方法是创建一个评级处理程序。但问题是解析这个问题所需的时间。我要解析的文件是300 MB，并且有大约20个这样的例程，比如评级。所以我解析一次大程序，然后解析大程序的一部分20次。还有另一种方法吗？是否有另一个XML模块比XML :: Twig更有用？

Answer 1

所以你希望节点匹配XPath

descendant:*[count(*)=0]

又名

.//*[count(*)=0]

相对于elt元素。我使用XML :: LibXML，所以我会做

$elt_node->findnodes("descendant:*[count(*)=0]")

XML :: Twig应该可以使用类似的解决方案。（确实有findnodes。）

ug，我忘了XML :: Twig对XPath的支持有多糟糕。它不知道count，*与非元素匹配。没问题，我们只需要自己完成工作。

use strict;
use warnings;
use feature qw( say );

use XML::Twig qw( );

my @eles = qw( Book Snap Line1 Line2 Line3 Line4 Line5 C1 C2 C3 );

my $twig = XML::Twig->new(
   twig_handlers => {
      '/APP:Report/library/elt' => sub {
         my ($twig, $ele) = @_;

         my %row =
            map { $_->name() => $_->text() // '' }
               # $ele->findnodes("descendant:*[count(*)=0]")
               grep { $_->name() ne '#PCDATA' && ( grep { $_->name() ne '#PCDATA' } $_->children ) == 0 }
                  $ele->descendants();

         say join '|', @row{@eles};

         $twig->purge();  # Free unneeded memory.
      },
   },
);

say join '|', @eles;    
$twig->parsefile('my_big.xml');

输出：

Book|Snap|Line1|Line2|Line3|Line4|Line5|C1|C2|C3
The book of pages||The Beginning|We ceased to exist|Accept it|Now we live|We reject it|6.1|8.9|9.4
The song||This is how we do it|I hope this works|Please do|Begging you|Bye|2.3|9.9|4.5

如何使用Perl的XML :: Twig以XML格式显示后代？

1 个答案: