尝试使用XML :: LibXML模块拆分XML文件时出错

时间:2015-03-25 09:18:35

标签: xml perl xpath xml-libxml

我一直在尝试使用XML::LibXML模块拆分XML数据,但它会抛出这样的错误

Can't call method "findnodes" without a package or object reference

我的输入

<xml>
  <bhap id="1">
    <label>cylind - I</label>
    <title>premier</title>
    <rect id="S1">
      <title>Short</title>
      <label>1.</label>
      <p><text>welcome</text></p>
    </rect>
    <rect id="S2">
      <title>Definite</title>
      <label>2.</label>
      <p><text>welcome1</text></p>
    </rect>
  </bhap>
  <bhap id="2">
    <label>cylind – II</label>
    <title>AUTHORITIES AND ITS EMPLOYEES</title>
    <rect id="S3">
      <title>nauty.&#x2014;</title>
      <label>3.</label>
      <p><text>welcome3</text></p>
    </rect>
    <rect id=S4">
      <title>Term</title>
      <label>4.</label>
      <p><text>welcome4</text></p>
    </rect>
  </bhap>
</xml>

需要输出

档案1

<xml>
  <bhap id="1">
    <label>cylind - I</label>
    <title>premier</title>
    <rect id="S1">
      <title>Short</title>
      <label>1.</label>
      <p><text>welcome</text></p>
    </rect>
  </bhap>
</xml>

文件2

<xml>   
  <bhap id="1">
    <label>cylind - I</label>
    <title>premier</title>
    <rect id="S2">
      <title>Definite</title>
      <label>2.</label>
      <p><text>welcome1</text></p>
    </rect>
  </bhap>
</xml>

文件3

<xml>
  <bhap id="2">
    <label>cylind – II</label>
    <title>AUTHORITIES AND ITS EMPLOYEES</title>
    <rect id="S3">
      <title>nauty.&#x2014;</title>
      <label>3.</label>
      <p><text>welcome3</text></p>
    </rect>
  </bhap>
</xml>

档案4

<xml>       
  <bhap id="2">
    <label>cylind – II</label>
    <title>AUTHORITIES AND ITS EMPLOYEES</title>
    <rect id=S4">
      <title>Term</title>
      <label>4.</label>
      <p><text>welcome4</text></p>
    </rect>
  </bhap>
</xml>

我的代码

use XML::LibXML;

my $file   = shift || die "usage $0 <xmlfile>";
my $parser = XML::LibXML->new();
my $doc    = $parser->parse_file($file);

my @nodes = $doc->findnodes('//bhap');
foreach my $node1 (@nodes) {

    my $bhap = $node1->toString(), "\n";

    if ( $bhap =~ m/(<bhap.+?>.+?<\/title>)(.+?)(<\/bhap>)/is ) {

        my $bhap1 = $1;
        my $bhap2 = $2;
        my $bhap3 = $3;

        my $nodes1 = $bhap->findnodes('//rect');
        foreach my $node (@$nodes1) {

            my $rect = $node->toString();

            if ( $rect =~ m/(<rect\s*id="(.+?)">.+?<\/rect>)/is ) {

                my $var1 = $1;
                my $var2 = $2;

                print "file" $var2;
                print "<xml>" print $bhap1;
                print $var1;
                print $bhap3;
                print "</xml>";
            }
        }
    }
}

1 个答案:

答案 0 :(得分:2)

好的,所以你开始做得好,但接着......落入正则表达式&#39;陷阱。使用正则表达式解析XML不是一件好事,因为它太复杂了 - 做得好,你需要处理/验证标记嵌套,换行和各种基本只是使你的正则表达式的东西一段脆弱的代码。所以请不要。

但最重要的是 - 在发布查询之前始终使用strictwarnings。这些是您进行故障排除的第一个停靠点。

如果你这样做,你会看到以下内容:

print "file" $var2;

那根本不会起作用。还有一些其他人无法正常使用您的代码&#39;真的 - 这将是起点。

此外 - 您的XML无效 - 您的&#S4;&#39;我认为缺少引号。

无论如何,假设这只是一个错字,我从XML::Twig开始(因为我比LibXML更了解它而不是任何具体原因)并做这样的事情:

#!/usr/bin/perl

use strict;
use warnings;
use XML::Twig;

my %children_of;

#as we process, extract all the 'rect' elements - along with a reference to their context.
sub process_rect {
    my ( $twig, $rect ) = @_;
    push( @{ $children_of{ $rect->parent } }, $rect->cut );
}


my $twig = XML::Twig->new(
    'pretty_print'  => 'indented',
    'twig_handlers' => { 'rect' => \&process_rect },

);

$twig->parse( \*DATA );

#run through all the 'bhap' elements. 
foreach my $bhap ( $twig->root->children('bhap') ) {
    #find the rect elements under this bhap. 
    foreach my $rect ( @{ $children_of{$bhap} } ) {
        #create a new XML document - copy the 'root' name from your original document. 
        my $xml    = XML::Twig::Elt->new( $twig -> root -> name );
        #duplicate this 'bhap' element by copying it, rather than cutting it,
        #so we can paste it more than once (e.g. per 'rect')
        my $subset = $bhap->copy;
        #insert the 'bhap' into our new xml. 
        $subset->paste( last_child => $xml );
        #insert our cut rect beneath this bhap. 
        $rect->paste( last_child => $subset );

        #print the resulting XML. 
        print "--\n";
        $xml->print;
    }
}

__DATA__
<xml>

<bhap id="1">
                <label>cylind - I</label>
                <title>premier</title>
                <rect id="S1">
                    <title>Short</title>
                    <label>1.</label>
                    <p><text>welcome</text></p>
                </rect>
                <rect id="S2">
                    <title>Definite</title>
                    <label>2.</label>
                    <p><text>welcome1</text></p>
                </rect>
        </bhap>
            <bhap id="2">
                <label>cylind - II</label>
                <title>AUTHORITIES AND ITS EMPLOYEES</title>

                <rect id="S3">
                    <title>nauty.&#x2014;</title>
                    <label>3.</label>
                    <p><text>welcome3</text></p>
                </rect>

                <rect id="S4">
                    <title>Term</title>
                    <label>4.</label>
                    <p><text>welcome4</text></p>
                </rect></bhap>

</xml>

我们对XML进行了预处理,并且&#39;剪掉了&#39; rect个节点。然后我们循环遍历每个bhap节点 - 复制它们,并在它们下面插入相关的rect

这给出了输出:

--
<xml>
  <bhap id="1">
    <label>cylind - I</label>
    <title>premier</title>
    <rect id="S1">
      <title>Short</title>
      <label>1.</label>
      <p>
        <text>welcome</text>
      </p>
    </rect>
  </bhap>
</xml>
--
<xml>
  <bhap id="1">
    <label>cylind - I</label>
    <title>premier</title>
    <rect id="S2">
      <title>Definite</title>
      <label>2.</label>
      <p>
        <text>welcome1</text>
      </p>
    </rect>
  </bhap>
</xml>
--
<xml>
  <bhap id="2">
    <label>cylind - II</label>
    <title>AUTHORITIES AND ITS EMPLOYEES</title>
    <rect id="S3">
      <title>nauty.—</title>
      <label>3.</label>
      <p>
        <text>welcome3</text>
      </p>
    </rect>
  </bhap>
</xml>
--
<xml>
  <bhap id="2">
    <label>cylind - II</label>
    <title>AUTHORITIES AND ITS EMPLOYEES</title>
    <rect id="S4">
      <title>Term</title>
      <label>4.</label>
      <p>
        <text>welcome4</text>
      </p>
    </rect>
  </bhap>
</xml>

至少看起来相当接近你正在尝试制作的东西。我已经跳过阅读文件并打印出内容,因为重建XML是更难的部分。

我还建议您查看XML::Twig提供的xml_split,因为这可能完全符合您的要求。