使用Perl解析复杂XML的最佳方法?

时间:2014-07-11 15:25:59

标签: perl xml-parsing

我必须使用Perl解析几个XML文件并在散列中存储变量。如果可能,我想过滤某些属性。稍后在我的代码中,我从哈希中提取数据并插入到数据库中。

我一直在使用XML::Parser,但我更愿意解析哈希而不是处理遇到的每个标记。有什么建议吗?

我想跳过任何具有属性kind="dir"的路径。我需要路径的作者,日期,消息和文件类型(文件扩展名)。 <path>代码可以包含任意数字,可以是kind&#34;文件&#34;或者&#34; dir&#34;。还可以有多个<logentry>代码。

<?xml version="1.0" encoding="UTF-8"?>
<log>
    <logentry revision="3989">
        <author>cergyl</author>
        <date>2013-07-19T05:31:01.212620Z</date>
        <paths>
            <path action="M" kind="dir">/team.admin/trunk/auth.conf</path>
        </paths>
        <path action="M" kind="file">/team.admin/trunk/file.cpp</path>
        <msg>Whitespace change to verify repository synchronization</msg>
    </logentry>
</log>

my $XML_Parser = XML::Parser->new(
                                  Handlers => {
                                                 Start   => \&hdl_xml_tag_start,
                                                 End     => \&hdl_xml_tag_end,
                                                 Char    => \&hdl_xml_nonmarkup_char,
                                                 Default => \&hdl_xml_default
                                               }
                                 );

# This event is generated when an XML start tag is recognized. Parser is an XML::Parser::Expat instance.
sub hdl_xml_tag_start
{
    my ( $parser, $element, %attributes ) = @_;
    $attributes{ '_str' } = "$element:";
    $XML_Attributes_Hash_Ref = \%attributes;
    return;
}

# This event is generated when an XML end tag is recognized. Note that an XML empty tag (<foo/>) generates both a start and an end event.
sub hdl_xml_tag_end
{
    my ( $parser, $element ) = @_;

    #format_message($XML_Attributes_Hash_Ref);
    format_svn_history( $XML_Attributes_Hash_Ref );
    return;
}


# This event is generated when non-markup is recognized. The non-markup sequence of characters is in String.
# A single non-markup sequence of characters may generate multiple calls to this handler.
sub hdl_xml_nonmarkup_char
{
    my ( $parser, $string ) = @_;
    $XML_Attributes_Hash_Ref->{ '_str' } .= $string;
    return;
}

#This is called for any characters that don't have a registered handler.
sub hdl_xml_default { return; }

2 个答案:

答案 0 :(得分:2)

由于您提供的信息有限,很难编写全面的解决方案,但这里有一些使用XML::Twig处理您显示的XML数据并显示所有(一个)path元素的内容没有kind属性等于dir

XML::LibXML也是基于C编码libxml2

的高质量模块
use strict;
use warnings;

use XML::Twig;

my $parser = XML::Twig->new(
  twig_handlers => {
    path => \&path_handler,
  }
);

$parser->parse(*DATA);

sub path_handler {
  my ($twig, $path) = @_;
  return if $path->att('kind') eq 'dir';
  print $path->text, "\n";
}


__DATA__
<?xml version="1.0" encoding="UTF-8"?>
<log>
    <logentry revision="3989">
        <author>cergyl</author>
        <date>2013-07-19T05:31:01.212620Z</date>
        <paths>
            <path action="M" kind="dir">/team.admin/trunk/auth.conf</path>
        </paths>
        <path action="M" kind="file">/team.admin/trunk/file.cpp</path>
        <msg>Whitespace change to verify repository synchronization</msg>
    </logentry>
</log>

<强>输出

/team.admin/trunk/file.cpp

答案 1 :(得分:0)

就个人而言,我喜欢来自XML::DOM的XML :: DOM :: Parser。但我使用XML :: Twig来打印它们。

my $xp = XML::DOM::Parser->new(); my $doc = $xp->parse("<xml></xml>"); $doc->dispose(); my $doc = $xp->parsefile("file.xml"); $doc->dispose(); // Pretty Print My poorly formatted xml doc my $xpp = XML::Twig->new(pretty_print => 'indented'); $xpp->parse("<xml></xml>"); $xpp->print();