Question

我有以下大型xml文件（5-10gb）：

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>          
   </book>
   <car id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
   </car>
   <book id="bk101">
      <author>Joseph</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
   </book>
   <magazine id="bk103">
      <author>Gambardella, Matthew</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
   </magazine>
   .....
</catalog>

我如何使用XML TWIG或PERL中的任何其他方法从书籍和杂志元素（忽略汽车）中读取内容，而仅将包含作者姓名Gambardella，Matthew的元素（整个块）提取到新文件中？

   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>          
   </book>      
   <magazine id="bk103">
      <author>Gambardella, Matthew</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
   </magazine>
   .....
</catalog>

Answer 1

该脚本期望XML文件作为命令行参数，并将删除所有与$criteria不匹配的元素。您还应该考虑将输入文件分成较小的块，以避免出现out of memory问题。

#!/usr/bin/env perl

use warnings FATAL => 'all';
use strict;
use XML::Twig;

my $criteria = 'Gambardella, Matthew';
my $xml  = XML::Twig->new(
  twig_handlers => {
    'catalog/*' => \&catalog,
  },
  pretty_print => 'indented',
)->parsefile($ARGV[0]);

print $xml->toString();

sub catalog {
  my ($t, $catalog) = @_;

  $catalog->cut() unless $catalog->findvalue('author') eq $criteria;

  return;
}

使用XML :: TWIG从XML文件中获取特定的原始元素及其子元素？

1 个答案: