查找XML结构化文档中的部分级别 - perl

时间:2015-08-17 09:17:54

标签: xml perl

查找XML结构化文档中的部分级别 - perl 输入:

<section>
   <para>...level 1</para>
   <para>...level 1</para>
   <para>...level 1</para>
   <section>
      <para>...level 2</para>
      <para>...level 2</para>
      <section>
         <para>...level 3</para>
         <para>...level 3</para>
         <para>...level 3</para>
      </section>
      <para>...level 2</para>
   </section>
   <section>
      <para>...level 2</para>
      <para>...level 2</para>
      <para>...level 2</para>
   </section>
</section>
<section>
   <para>...level 1</para>
   <para>...level 1</para>
   <para>...level 1</para>
   <section>
      <para>...level 2</para>
      <para>...level 2</para>
      <para>...level 2</para>
   </section>
   <section>
      <para>...level 2</para>
      <para>...level 2</para>
      <para>...level 2</para>
   </section>
</section>

我需要获取所有节级元素并根据级别插入值。所需的输出如下:

<section1>
<para>...level 1</para>
<para>...level 1</para>
<para>...level 1</para>
   <section2>
   <para>...level 2</para>
   <para>...level 2</para>
      <section3>
      <para>...level 3</para>
      <para>...level 3</para>
      <para>...level 3</para>
      </section3>
   <para>...level 2</para>
   </section2>
   <section2>
   <para>...level 2</para>
   <para>...level 2</para>
   <para>...level 2</para>
   </section2>
</section1>
<section1>
<para>...level 1</para>
<para>...level 1</para>
<para>...level 1</para>
   <section2>
   <para>...level 2</para>
   <para>...level 2</para>
   <para>...level 2</para>
   </section2>
   <section2>
   <para>...level 2</para>
   <para>...level 2</para>
   <para>...level 2</para>
   </section2>
</section1>

首先尝试:

foreach my $lines ( @splitCnt ) {

    if ( $lines =~ m/<section\s+/g ) {
        $opn++;
        $lines =~ s/<section\s+/<section$opn /i;
        $cls = $opn;
        $opn++;
    }
    elsif ( $lines =~ m/<\/section>/g ) {
        $opn = $opn - 1;
        $lines =~ s/<\/section>/<\/section$opn>/i;
    }

    $all_lines .= "$lines\n";
}

第二次尝试:

my ( $pre1, $match1, $post1 ) = "";

while ( $incnt =~ m/<section\s+[^>]*>/g ) {

    $pre1   = $`;
    $match1 = $&;
    $post1  = $';
    my $Opn = '1';
    my $Cls = "";

    $match1 =~ s/<section\s+/<section$Opn /gi;

    if ( $post1 =~ m/<section\s+/i ) {
        $Opn++;
        $post1 =~ s/<section\s+/<section$Opn /;
        $Opn = $Cls;
    }
    elsif ( $post1 =~ m/<\/section>/i ) {
        $post1 =~ s/<\/section/<\/section$Cls/;
    }

    $pre1 .= $match1;
    $incnt = $post1;

    print "$pre1\n";
    system 'pause';
}

if ( length $pre1 ) {
    $incnt = $pre1 . $post1;
}

任何人都可以帮助这个......

2 个答案:

答案 0 :(得分:4)

说真的 - 不要对XML使用常规表达。这是个坏消息。有一些完全有效的东西,你可以用XML来打破正则表达式 - 所以你得到的是破碎的XML,以及可能在某一天可怕破坏的脆弱代码,没有人会知道为什么。

使用解析器。就个人而言 - 我喜欢XML::Twig

你可以很容易地采取和重命名标签:

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig;

sub process_section {
    my ( $section, $depth ) = @_;
    $depth++;
    $section->set_tag("section$depth");
    foreach my $subsection ( $section->children('section') ) {
        process_section( $subsection, $depth );
    }
}

my $twig = XML::Twig->new( 'pretty_print' => 'indented_a' );
$twig->parsefile ( 'your_file.xml' ); 

foreach my $section ( $twig->findnodes('section') ) {
    process_section( $section, 0 );
}

$twig->print;

我也会指出 - 你的初始问题听起来像XY problem。你想达到什么目的?进行这种操作通常是不可取的 - 根据层次结构更改标签,因为那时......好吧,那么你不能做我刚做过的事情 - 递归遍历数据结构。

答案 1 :(得分:2)

这是使用XML::LibXML模块的变体。它只是查找所有section元素并通过计算XPath表达式中的斜杠数来达到它们的层次结构

然而,正如其他人所说的那样,这是一件很奇怪的事情,而且听起来很像一个不同问题的解决方案。如果您解释了完整的问题,那么我们可以帮助您更好

use strict;
use warnings;

use XML::LibXML;

my $doc = XML::LibXML->load_xml(IO => \*DATA);

for my $section ( $doc->findnodes('//section') ) {
    my $n = $section->nodePath =~ tr|/|| - 1;
    $section->setNodeName("section$n");
}

print $doc;

__DATA__
<root>
    <section>
        <para>...level 1</para>
        <para>...level 1</para>
        <para>...level 1</para>
        <section>
            <para>...level 2</para>
            <para>...level 2</para>
            <section>
                <para>...level 3</para>
                <para>...level 3</para>
                <para>...level 3</para>
            </section>
            <para>...level 2</para>
        </section>
        <section>
            <para>...level 2</para>
            <para>...level 2</para>
            <para>...level 2</para>
        </section>
    </section>
    <section>
        <para>...level 1</para>
        <para>...level 1</para>
        <para>...level 1</para>
        <section>
            <para>...level 2</para>
            <para>...level 2</para>
            <para>...level 2</para>
        </section>
        <section>
            <para>...level 2</para>
            <para>...level 2</para>
            <para>...level 2</para>
        </section>
    </section>
</root>

输出

<?xml version="1.0"?>
<root>
    <section1>
        <para>...level 1</para>
        <para>...level 1</para>
        <para>...level 1</para>
        <section2>
            <para>...level 2</para>
            <para>...level 2</para>
            <section3>
                <para>...level 3</para>
                <para>...level 3</para>
                <para>...level 3</para>
            </section3>
            <para>...level 2</para>
        </section2>
        <section2>
            <para>...level 2</para>
            <para>...level 2</para>
            <para>...level 2</para>
        </section2>
    </section1>
    <section1>
        <para>...level 1</para>
        <para>...level 1</para>
        <para>...level 1</para>
        <section2>
            <para>...level 2</para>
            <para>...level 2</para>
            <para>...level 2</para>
        </section2>
        <section2>
            <para>...level 2</para>
            <para>...level 2</para>
            <para>...level 2</para>
        </section2>
    </section1>
</root>