如何在perl中使用xpath拆分xml?

时间:2014-11-19 14:50:10

标签: xml perl xpath

我有一个输入xml,我必须根据doc和delt明智进行拆分并将其保存为这种格式delt_0001.xml

这是我的代码

    #!/usr/bin/perl
    use XML::XPath;

    my $file = 'file.xml';

    my $xp = XML::XPath->new(filename=>$file);

     foreach my $entry ( $xp->findnodes('/xml/service/main/doc') ) {
       my $filename = $entry->findvalue('./delt/@id');
      foreach my $entry1( $entry->findnodes('//delt')){

     my $filename = $entry1->findvalue('/delt/@id');
         my $content  = $entry1->toString;
    open(wr,">delt_$filename.xml");
    print wr "$content\n";
    close wr;

    }

当我运行程序时,所有delt部分都以一个xml打印。

输入xml delt.xml

  <xml>
<service>
<title>split xml</title>
<main>
<doc id="001">
<title>doc1</title>
<delt id="0001">
<title>delt1</title>
<text>num1</text>``
<text>num1</text>
</delt>
<delt id="0002-A">
<title>delt1</title>
<text>num1</text>
<text>num1</text>
</delt>
</doc>
<doc id="002">
<title>doc2</title>
<delt id="0003">
<title>delt1</title>
<text>num1</text>
<text>num1</text>
</delt>
<delt id="0004">
<title>delt1</title>
<text>num1</text>
<text>num1</text>
</delt>
</doc>
</main>
</service>
</xml>

输出结果

         <delt id="0001">
        <title>delt1</title>
        <text>num1</text>``
        <text>num1</text>
        </delt>
        <delt id="0002-A">
        <title>delt1</title>
        <text>num1</text>
        <text>num1</text>
        </delt>
       <delt id="0003">
        <title>delt1</title>
        <text>num1</text>
        <text>num1</text>
        </delt>
        <delt id="0004">
        <title>delt1</title>
        <text>num1</text>
        <text>num1</text>
        </delt>

需要输出

拆分1 delt_0001.xml

<xml>
<service>
<title>split xml</title>
<main>
<doc id=001>
<title>doc1</title>
<delt id=0001>
<title>delt1</title>
<text>num1</text>``
<text>num1</text>
</delt>
</doc>
</main>
</service>
</xml>

拆分2号delt_0002-A.xml

<xml>
<service>
<title>split xml</title>
<main>
<doc id=001>
<title>doc1</title>
<delt id=0002=A>
<title>delt1</title>
<text>num1</text>
<text>num1</text>
</delt>
</doc>
</main>
</service>
</xml>

拆分3 delt_0003.xml

<xml>
<service>
<title>split xml</title>
<main>
<doc id=002>
<title>doc2</title>
<delt id=0003>
<title>delt1</title>
<text>num1</text>
<text>num1</text>
</delt>
</doc>
</main>
</service>
</xml>

拆分4号delt_0004.xml

<xml>
<service>
<title>split xml</title>
<main>
<doc id=002>
<title>doc2</title>    
<delt id=0004>
<title>delt1</title>
<text>num1</text>
<text>num1</text>
<delt>
</doc>
</main>
</service>
</xml>

提前致谢

2 个答案:

答案 0 :(得分:1)

使用XML :: Twig执行此操作非常简单(我很高兴我在解析期间“删除了当前元素”以便工作一段时间):

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my $delt= 'delt.xml';

XML::Twig->new( twig_handlers => { delt => \&delt },
                pretty_print => 'indented',
              )
          ->parsefile( $delt);

exit;

sub delt
  { my( $t, $delt)= @_;

    my $delt_file= sprintf( 'delt_%s.xml', $delt->id);

    # the only tricky part: remove previous doc if needed
    if( my $prev_doc= $delt->parent( 'doc')->prev_sibling( 'doc')) 
      { $prev_doc->delete; }

    $t->print_to_file( $delt_file);

    $delt->delete;
  }

答案 1 :(得分:0)

你遇到困难的原因是因为你正在做的是从XML文档中提取一个子集,然后尝试也包含一些来自“父”的东西。

将你的'delts'拉出来会非常简单

我想用这个XML::Twig - 这是一个使用树枝处理程序的完美场所。

我会想到某些事情(和道歉,这还不太有效)。

use strict;
use warnings;
use XML::Twig;

sub process_delt {
    my ( $twig, $delt ) = @_;
    my $delt_id = $delt->att('id');
    print "\nID:\n$delt_id\n";
    my $filename = "$delt_id.xml";


    $delt->set_pretty_print('indented');
    $delt->print;

    print "\n--------\n";

}

my $twig = XML::Twig->new(
    twig_handlers => { delt => \&process_delt },
);
local $/;
$twig->parse(<DATA>);


__DATA__
<xml>
<service>
<title>split xml</title>
<main>
<doc id="001">
<title>doc1</title>
<delt id="0001">
<title>delt1</title>
<text>num1</text>``
<text>num1</text>
</delt>
<delt id="0002-A">
<title>delt1</title>
<text>num1</text>
<text>num1</text>
</delt>
</doc>
<doc id="002">
<title>doc2</title>
<delt id="0003">
<title>delt1</title>
<text>num1</text>
<text>num1</text>
</delt>
<delt id="0004">
<title>delt1</title>
<text>num1</text>
<text>num1</text>
</delt>
</doc>
</main>
</service>
</xml>

编辑:看看@ mirod的答案,因为它完全正常。这个只会提取每个'delt',然后你可能不得不搞砸找出父东西。