在Perl中使用XML :: DOM构建哈希树

时间:2013-01-24 16:21:35

标签: perl hash xml-parsing xmldom

我想使用XML::DOM模块解析一个简单的XML文档。

<?xml version ="1.0"?>
<Select>
  <book>
    <prop Name = "prop1" Title = "title1" />
    <prop Name = "prop2" Title = "title2" />
  </book>
  <fruit>
    <prop Name = "prop3" Title = "title3" />
    <prop Name = "prop4" Title = "title4" />
  </fruit>
</Select>

,预期输出为 -

$VAR1 = {
  Select => {
    book  => {
               prop => [
                 { Name => "prop1", Title => "title1" },
                 { Name => "prop2", Title => "title2" },
               ],
             },
    fruit => {
               prop => [
                 { Name => "prop3", Title => "title3" },
                 { Name => "prop4", Title => "title4" },
               ],
             },
  },
}

我写的代码是:

use strict;
use XML::DOM;
use Data::Dumper;

my @stack;
my %hash;
push @stack,\%hash;

my $parser = new XML::DOM::Parser;
my $doc = $parser -> parsefile('demo.xml');
my $root = $doc->getDocumentElement();
my $rootnode = $root->getTagName;

################################################################

foreach my $node ($doc->getElementsByTagName($rootnode)){
    push @stack,$stack[$#stack]->{$rootnode};
    my @childnode = $node->getChildNodes();

    foreach my $child(@childnode){
        if($child->isElementNode){
            my $childname = $child->getNodeName();
            pop(@stack);
            push @stack,$stack[$#stack]->{$rootnode} = {$childname,{}};
            my @childnodes2 = $child->getChildNodes();

            foreach my $subchild(@childnodes2){
                if($subchild->isElementNode){
                    my $subchildname = $subchild->getNodeName();

                    my $name = $subchild->getAttributes->getNamedItem('Name')->getNodeValue;
                    my $title = $subchild->getAttributes->getNamedItem('Title')->getNodeValue;
                    pop(@stack);
                    push @stack,$stack[$#stack]->{$rootnode}->{$child->getNodeName()} = {$subchildname,{}};    #{} contains $name or $title
                }
            }
        }
    }
}

print Dumper(\%hash);

我想,我无法正确推送和弹出数组。另外,我不想使用XML::Simple和递归。

我如何在Perl中执行此操作?

1 个答案:

答案 0 :(得分:1)

这是一个可能的解决方案,假设整个文档遵循一个严格的模式,其中一个Select作为根,任何不同名称的子节点(不会处理冲突),以及任何数量的{{1对于这些子节点,其中propName字段是唯一有趣的。

这是序言,我还使用Title来更好地处理错误。

Carp

这是主要代码。它启动一个解析器(假设该文档位于特殊的#!/usr/bin/perl use strict; use warnings; use 5.012; use XML::DOM; use Data::Dumper; use Carp; 文件句柄中),并将结果文档从DATA子例程中传递出去。我经常考虑让脚本make_data_structure尽早发现错误。

die

这是完成所有工作的子程序。它需要一个文档并返回一个符合您格式的hashref。

{
    my $xml_parser = XML::DOM::Parser->new;
    my $document_string = do{ local $/=undef; <DATA> };
    my $document = $xml_parser->parse($document_string) or die;

    my $data_structure = make_data_structure($document) or die;
    print Dumper $data_structure;
}

以下是自定义错误处理子例程,以使上述代码更具表现力。

sub make_data_structure {
    my ($document) = @_;
    my $root = $document->getDocumentElement;
    my $rootname = $root->getTagName // "undef";

    didnt_expect_anything(but=> "Select", as=> "the root tag", got=> $rootname)
        unless $rootname eq "Select";

    my $dsc = +{ $rootname => +{} };
    CHILD:
    for my $child ($root->getChildNodes) {
        next CHILD unless $child->isElementNode;

        my $childname = $child->getTagName
            // couldnt_get("the tag name", of=> "a $rootname child");

        $dsc->{$rootname}{$childname} = undef; # unneccessary iff we have props
        PROP:
        for my $prop ($child->getChildNodes) {
            next PROP unless $prop->isElementNode;

            my $propname = $prop->getTagName // "undef";

            die didnt_expect_anything(but=> "prop", got=> $propname)
                unless $propname eq "prop";

            my $attributes = $prop->getAttributes
                // couldnt_get("the attributes", of=> "a prop node");

            # for minimum code duplication, and maximum error handling,
            # use dataflow programming, and `map`. 
            my ($Name, $Title) =
                map { $_->getNodeValue // couldnt_get("the node value", of=>"the attribute") }
                map { $attributes->getNamedItem($_) // couldnt_get("the named item $_", of=> "the prop attributes") }
                    qw/Name Title/;
            my $propvalue = +{
                Name    => $Name,
                Title   => $Title,
            };

            push @{ $dsc->{$rootname}{$childname}{$propname} }, $propvalue;
        }
    }
    return $dsc;
}

当然,产生了正确的输出,但这正确的到达方式 - 使用了CPAN。

你的实施问题的一部分是(除了缺少错误处理),你用你的“堆叠”做一些复杂的体操。

在外循环的第一次迭代之前,sub didnt_expect_anything { my %args = @_; my $expected = $args{but} // croak qq(required named argument "but" missing); my $role = $args{as} // "a tag name"; my $instead = $args{got} // croak qq(required named argument "got" missing); croak qq(Didn't expect anything but "$expected" as $role here, got "$instead"); } sub couldnt_get { my ($what, %args) = @_; my $of_what = $args{of} // croak qq(required named argument "of" missing); croak qq(Couldn't get $what of $of_what); } @stack(对空哈希的引用)。

+{}访问堆栈的最后一个元素(更好地写为$stack[$#stack]->{$rootnode}),将值视为散列引用,并查找名为$stack[-1]的条目。评估结果为$rootnode。然后将此推入堆栈。随之而来的是混乱。