我可以轻松地在java或c#中做同样的事情但是在shell脚本中执行此操作需要大量的学习...所以任何帮助都表示赞赏
我有一个巨大的xml节点,其子节点如URL(比如100K节点),我需要在每个子文件中拆分带有10K节点的input.xml,因此我得到10个包含带有父标签的10K节点的文件( URLSet标签)。
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
</urlset>
答案 0 :(得分:2)
简短回答是肯定的,这是完全可行的。
XML::Twig
支持“剪切”和“粘贴”操作,以及增量解析(用于更低内存占用)。
所以你会做类似的事情:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
#new document. Manually set xmlns - could copy this from 'original'
#instead though.
my $new_doc = XML::Twig->new;
$new_doc->set_root(
XML::Twig::Elt->new(
'urlset', { xmlns => "http://www.sitemaps.org/schemas/sitemap/0.9" }
)
);
$new_doc->set_pretty_print('indented_a');
my $elt_count = 0;
my $elts_per_doc = 2;
my $count_of_xml = 0;
#handle each 'url' element.
sub handle_url {
my ( $twig, $elt ) = @_;
#more than the count, we output this doc, close it,
#then create a new one.
if ( $elt_count >= $elts_per_doc ) {
$elt_count = 0;
open( my $output, '>', "new_xml_" . $count_of_xml++ . ".xml" )
or warn $!;
print {$output} $new_doc->sprint;
close($output);
$new_doc = XML::Twig->new();
$new_doc->set_root(
XML::Twig::Elt->new(
'urlset',
{ xmlns => "http://www.sitemaps.org/schemas/sitemap/0.9" }
)
);
$new_doc->set_pretty_print('indented_a');
}
#cut this element, paste it into new doc.
#note - this doesn't alter the original on disk - only the 'in memory'
#copy.
$elt->cut;
$elt->paste( $new_doc->root );
$elt_count++;
#purge clears any _closed_ tags from memory, so it preserves
#structure.
$twig->purge;
}
#set a handler, start the parse.
my $twig = XML::Twig->new( twig_handlers => { 'url' => \&handle_url } ) ->parsefile ( 'your_file.xml' );