想要基于标记拆分UNIX xml文件

时间:2015-07-08 15:08:19

标签: xml unix xml-parsing

我的XML文件包含如下批量。

我想基于使用shell脚本的标记将此文件拆分为5个文件。 请提前帮助,谢谢。

<Items>
<Item>
<Title>Title 1</Title>
<DueDate>01-02-2008</DueDate>
</Item>
<Item>
<Title>Title 2</Title>
<DueDate>01-02-2009</DueDate>
</Item>
<Item>
<Title>Title 3</Title>
<DueDate>01-02-2010</DueDate>
</Item>
<Item>
<Title>Title 4</Title>
<DueDate>01-02-2011</DueDate>
</Item>
<Item>
<Title>Title 5</Title>
<DueDate>01-02-2012</DueDate>
</Item>
</Items>

期望的输出:

<Items>
<Item>
<Title>Title 1</Title>
<DueDate>01-02-2008</DueDate>
</Item>
</Items>

1 个答案:

答案 0 :(得分:1)

I would suggest - install XML::Twig which includes the rather handy xml_split utility. That may do what you need. E.g.:

xml_split -c Item

However I'd offer what you're trying to accomplish isn't amazingly easy, because you're trying to cut up and retain the XML structure. You can't do it with standard line/regex based tools.

However you can use a parser:

#!/usr/bin/env perl

use strict;
use warnings;
use XML::Twig;

my @item_list;

sub cut_item {
    my ( $twig, $item ) = @_;
    my $thing = $item->cut;
    push( @item_list, $thing );

}

my $twig = XML::Twig->new(
    twig_handlers => { 'Item' => \&cut_item }
);
$twig->parse(<>);

my $itemcount = 1;

foreach my $element (@item_list) {
    my $newdoc = XML::Twig->new( 'pretty_print' => 'indented_a' );
    $newdoc->set_root( XML::Twig::Elt->new('Items') );

    $element->paste( $newdoc->root );
    $newdoc->print;
    open( my $output, ">", "items_" . $itemcount++ . ".xml" );
    print {$output} $newdoc->sprint;
    close($output);
}

This uses the XML::Twig library to extract each of the Item elements from your XML (piped on STDIN, or via myscript.pl yourfilename).

It then iterates all the ones it found, adds an Items header, and prints it to a separate file. This approach might take a little more fiddling if you had a more complex root, but it is adaptable if you do.