我有以下大型xml文件(5-10gb):
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
</book>
<car id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
</car>
<book id="bk101">
<author>Joseph</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
</book>
<magazine id="bk103">
<author>Gambardella, Matthew</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
</magazine>
.....
</catalog>
我如何使用XML TWIG或PERL中的任何其他方法从书籍和杂志元素(忽略汽车)中读取内容,而仅将包含作者姓名Gambardella,Matthew的元素(整个块)提取到新文件中?
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
</book>
<magazine id="bk103">
<author>Gambardella, Matthew</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
</magazine>
.....
</catalog>
答案 0 :(得分:0)
该脚本期望XML文件作为命令行参数,并将删除所有与$criteria
不匹配的元素。您还应该考虑将输入文件分成较小的块,以避免出现out of memory
问题。
#!/usr/bin/env perl
use warnings FATAL => 'all';
use strict;
use XML::Twig;
my $criteria = 'Gambardella, Matthew';
my $xml = XML::Twig->new(
twig_handlers => {
'catalog/*' => \&catalog,
},
pretty_print => 'indented',
)->parsefile($ARGV[0]);
print $xml->toString();
sub catalog {
my ($t, $catalog) = @_;
$catalog->cut() unless $catalog->findvalue('author') eq $criteria;
return;
}