使用unix脚本基于子节点块将大型XML拆分为较小的文件

时间:2016-10-07 15:02:34

标签: xml bash perl shell unix

我可以轻松地在java或c#中做同样的事情但是在shell脚本中执行此操作需要大量的学习...所以任何帮助都表示赞赏

我有一个巨大的xml节点,其子节点如URL(比如100K节点),我需要在每个子文件中拆分带有10K节点的input.xml,因此我得到10个包含带有父标签的10K节点的文件( URLSet标签​​)。

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

<url>
  <loc> https://www.mywebsite.com/shopping </loc>
  <changefreq> Weekly </changefreq>
  <priority> 0.8 </priority>
  <lastmod> 2016-09-22 </lastmod>
</url>
<url>
  <loc> https://www.mywebsite.com/shopping </loc>
  <changefreq> Weekly </changefreq>
  <priority> 0.8 </priority>
  <lastmod> 2016-09-22 </lastmod>
</url>
<url>
  <loc> https://www.mywebsite.com/shopping </loc>
  <changefreq> Weekly </changefreq>
  <priority> 0.8 </priority>
  <lastmod> 2016-09-22 </lastmod>
</url>
<url>
  <loc> https://www.mywebsite.com/shopping </loc>
  <changefreq> Weekly </changefreq>
  <priority> 0.8 </priority>
  <lastmod> 2016-09-22 </lastmod>
</url>
<url>
  <loc> https://www.mywebsite.com/shopping </loc>
  <changefreq> Weekly </changefreq>
  <priority> 0.8 </priority>
  <lastmod> 2016-09-22 </lastmod>
</url>
<url>
  <loc> https://www.mywebsite.com/shopping </loc>
  <changefreq> Weekly </changefreq>
  <priority> 0.8 </priority>
  <lastmod> 2016-09-22 </lastmod>
</url>
</urlset>

1 个答案:

答案 0 :(得分:2)

简短回答是肯定的,这是完全可行的。

XML::Twig支持“剪切”和“粘贴”操作,以及增量解析(用于更低内存占用)。

所以你会做类似的事情:

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig;

#new document. Manually set xmlns - could copy this from 'original'
#instead though. 
my $new_doc = XML::Twig->new;
$new_doc->set_root(
   XML::Twig::Elt->new(
      'urlset', { xmlns => "http://www.sitemaps.org/schemas/sitemap/0.9" }
   )
);
$new_doc->set_pretty_print('indented_a');

my $elt_count    = 0;
my $elts_per_doc = 2;
my $count_of_xml = 0;

#handle each 'url' element. 
sub handle_url {
   my ( $twig, $elt ) = @_;
   #more than the count, we output this doc, close it,
   #then create a new one. 
   if ( $elt_count >= $elts_per_doc ) {
      $elt_count = 0;
      open( my $output, '>', "new_xml_" . $count_of_xml++ . ".xml" )
        or warn $!;
      print {$output} $new_doc->sprint;
      close($output);
      $new_doc = XML::Twig->new();
      $new_doc->set_root(
         XML::Twig::Elt->new(
            'urlset',
            { xmlns => "http://www.sitemaps.org/schemas/sitemap/0.9" }
         )
      );
      $new_doc->set_pretty_print('indented_a');
   }

   #cut this element, paste it into new doc. 
   #note - this doesn't alter the original on disk - only the 'in memory' 
   #copy. 
   $elt->cut;
   $elt->paste( $new_doc->root );
   $elt_count++;
   #purge clears any _closed_ tags from memory, so it preserves 
   #structure.
   $twig->purge;
}

#set a handler, start the parse.

my $twig = XML::Twig->new( twig_handlers => { 'url' => \&handle_url } ) ->parsefile ( 'your_file.xml' );