Perl替换嵌套块正则表达式

时间:2014-03-13 17:42:59

标签: regex perl html-parsing

我需要在哈希数组或哈希树中获取嵌套块,以便能够用动态内容替换块。我需要替换

之间的代码
<!--block:XXX-->

和第一个结束结束块

<!--endblock--> 

我的动态内容。

我有这个代码可以找到一个级别的注释块,但不能嵌套:

#<!--block:listing-->... html code block here ...<!--endblock-->
$blocks{$1} = $2 while $content =~ /<!--block:(.*?)-->((?:(?:(?!<!--(.*?)-->).)|(?R))*?)<!--endblock-->/igs;

这是我想要处理的完整嵌套html模板。所以我需要找到并替换内部块“block:third”并将其替换为我的内容,然后找到“block:second”并替换它然后找到外部块“block:first”并替换它。请注意,可以有任意数量的嵌套块,而不仅仅是三个,如下例所示,它可能是几个嵌套块。

use Data::Dumper;

$content=<<HTML;
some html content here

<!--block:first-->
    some html content here

    <!--block:second-->
        some html content here

        <!--block:third-->
            some html content here
        <!--endblock-->

        some html content here
    <!--endblock-->

    some html content here
<!--endblock-->
HTML

$blocks{$1} = $2 while $content =~ /<!--block:(.*?)-->((?:(?:(?!<!--(.*?)-->).)|(?R))*?)<!--endblock-->/igs;
print Dumper(%blocks);

因此,我可以访问和修改$block{first} = "my content here"$block{second} = "another content here"等块,然后替换块。

我创建了这个regex

4 个答案:

答案 0 :(得分:2)

更新

这是对“合并”到单个正则表达式的回应...

看来你不关心重建html的顺序 因此,如果您只想隔离每个子部分的内容,则可以使用以下内容 但是,您需要列表([])来重新构建嵌入子部分的顺序。

使用此问题刷新自己后,请注意下面使用的正则表达式是您应该使用的正则表达式。

use Data::Dumper;

$/ = undef;
my $content = <DATA>;


my $href = {};

ParseCore( $href, $content );

#print Dumper($href);

print "\nBase======================\n";
print $href->{content};
print "\nFirst======================\n";
print $href->{first}->{content};
print "\nSecond======================\n";
print $href->{first}->{second}->{content};
print "\nThird======================\n";
print $href->{first}->{second}->{third}->{content};
print "\nFourth======================\n";
print $href->{first}->{second}->{third}->{fourth}->{content};
print "\nFifth======================\n";
print $href->{first}->{second}->{third}->{fourth}->{fifth}->{content};

exit;

sub ParseCore
{
    my ($aref, $core) = @_;
    my ($k, $v);
    while ( $core =~ /(?is)(<!--block:(.*?)-->((?:(?:(?!<!--block:(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--block:.*?-->).)+))/g )
    {
       if (defined $2) {
           $k = $2; $v = $3;
           $aref->{$k} = {};
 #         $aref->{$k}->{content} = $v;
 #         $aref->{$k}->{match} = $1;

           my $curraref = $aref->{$k};
           my $ret = ParseCore($aref->{$k}, $v);
           if (defined $ret) {
               $curraref->{'#next'} = $ret;
           }
        }
        else
        {
           $aref->{content} .= $4;
        }
    }
    return $k;
}

#================================================
__DATA__
some html content here top base
<!--block:first-->
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here1-5 bottom base

some html content here 6-8 top base
<!--block:six-->
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
<!--endblock-->
some html content here 6-8 bottom base

输出&gt;&gt;

Base======================
some html content here top base

some html content here1-5 bottom base

some html content here 6-8 top base

some html content here 6-8 bottom base
First======================

    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top

    some html content here 1 bottom

Second======================

        some html content here 2 top

        some html content here 2 bottom

Third======================

            some html content here 3 top

            some html content here 3a
            some html content here 3b

Fourth======================

                some html content here 4 top


Fifth======================

                    some html content here 5a
                    some html content here 5b

你可以使用 REGEX 递归来匹配outter nesting,然后解析内部 CORE的
使用简单的递归函数调用。

然后它也可以解析你所在的嵌套级别的内容 它也可以在整个过程中创建一个嵌套结构,以便您以后使用 做模板替换。

然后你可以重建html 唯一棘手的部分是遍历数组。但是,如果你知道如何遍历 它的数组(scalars,array / hash ref等)应该没问题。

以下是样本。

    # (?is)<!--block:(.*?)-->((?:(?:(?!<!--(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+)

    (?is)                         # Modifiers: Case insensitive, Dot-all
    <!--block:                    # Begin BLOCK
    ( .*? )                       # (1), block name
    -->

    (                             # (2 start), Begin Core
         (?:
              (?:
                   (?!
                        <!--
                        (?: .*? )
                        -->
                   )
                   . 
              )
           |  (?R) 
         )*?
    )                             # (2 end), End Core

    <!--endblock-->               # End BLOCK
 |  
    (                             # (3 start), Or grab content within this core
         (?:
              (?! <!-- .*? --> )
              . 
         )+
    )                             # (3 end)

Perl测试用例

use Data::Dumper;

$/ = undef;
my $content = <DATA>;


my %blocks = ();
$blocks{'base'} = [];


ParseCore( $blocks{'base'}, $content );


sub ParseCore
{
    my ($aref, $core) = @_;
    while ( $core =~ /(?is)<!--block:(.*?)-->((?:(?:(?!<!--(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+)/g )
    {
        if ( defined $1 )
        {
           my $branch = {};
           push @{$aref}, $branch;
           $branch->{$1} = [];
           ParseCore( $branch->{$1}, $2 );
        }
        elsif ( defined $3 )
        {
           push @{$aref}, $3;
        }
    }

}

print Dumper(\%blocks);

__DATA__

some html content here top base
<!--block:first-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here bottom base

输出&gt;&gt;

$VAR1 = {
          'base' => [
                      '
some html content here top base
',
                      {
                        'first' => [
                                     '
    some html content here 1 top
    ',
                                     {
                                       'second' => [
                                                     '
        some html content here 2 top
        ',
                                                     {
                                                       'third' => [
                                                                    '
            some html content here 3a
            some html content here 3b
        '
                                                                  ]
                                                     },
                                                     '
        some html content here 2 bottom
    '
                                                   ]
                                     },
                                     '
    some html content here 1 bottom
'
                                   ]
                      },
                      '
some html content here bottom base
'
                    ]
        };

答案 1 :(得分:1)

基于上面的@sln回答,尽管建议使用Perl模板或解析器模块,但我保证这些模块中没有一个可以直接处理这个问题。

这是我提出的解决方案。

首先,我使用简单的正则表达式找到整个内容或模板中的外部块:

/(<!--block:.*?-->(?:(?:(?!<!--(?:.*?)-->).)|(?R))*?<!--endblock-->)/gis

然后我解析每个外部块,根据上面的@sln回答找到它的嵌套子块。

/(<!--block:(.*?)-->((?:(?:(?!<!--block:(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+))/igsx

然后一切都运转良好。我测试了两个外部块,每个块都有嵌套块。

我可以像这样到达任何子块:

print $blocks->{first}->{content};

print $blocks->{first}->{match};

print $blocks->{first}->{second}->{third}->{fourth}->{content}

每个块散列引用都有键:

`content`: the block content without the block name and endblock tags.
`match`: the block content with the block name and endblock tags, good for replacing.
`#next`: has the sub block name if exists, good to check if block has children and access them.

以下是最终的Perl测试和工作代码。

use Data::Dumper;

$/ = undef;
my $content = <DATA>;

my $blocks = parse_blocks($content);

print Dumper($blocks);

#print join "\n", keys( %{$blocks->{first}}); # root blocks names
#print join "\n", keys( %{$blocks->{first}}); # 
#print join "\n", keys( %{$blocks->{first}->{second}});

#print Dumper $blocks->{first};
#print Dumper $blocks->{first}->{content};
#print Dumper $blocks->{first}->{match};

# check if fourth block has sub block.
#print exists $blocks->{first}->{second}->{third}->{fourth}->{'#next'}, "\n";

# check if block has sub block, get it:
#if (exists $blocks->{first}->{second}->{third}->{fourth}->{'#next'}) {
#   print $blocks->{first}->{second}->{third}->{fourth}->{ $blocks->{first}->{second}->{third}->{fourth}->{'#next'} }->{content}, "\n";
#}

exit;
#================================================
sub parse_blocks {
    my ($content) = @_;
    my $href = {};
    # find outer blocks only
    while ($content =~ /(<!--block:.*?-->(?:(?:(?!<!--(?:.*?)-->).)|(?R))*?<!--endblock-->)/gis) {
        # parse each outer block nested blocks
        parse_nest_blocks($href, $1);
    }
    return $href;
}
#================================================
sub parse_nest_blocks {
    my ($aref, $core) = @_;
    my ($k, $v);
    while ( $core =~ /(<!--block:(.*?)-->((?:(?:(?!<!--block:(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+))/igsx )
    {
        if (defined $2) {
           $k = $2; $v = $3;
           $aref->{$k} = {};
           $aref->{$k}->{content} = $v;
           $aref->{$k}->{match} = $1;
           #print "1:{{$k}}\n2:[[$v]]\n";
           my $curraref = $aref->{$k};
           my $ret = parse_nest_blocks($aref->{$k}, $v);
           if ($ret) {
               $curraref->{'#next'} = $ret;
           }
           return $k;
        }
    }

}
#================================================
__DATA__
some html content here top base
<!--block:first-->
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here1-5 bottom base

some html content here 6-8 top base
<!--block:six-->
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
<!--endblock-->
some html content here 6-8 bottom base

并且整个哈希转储的输出是:

$VAR1 = {
          'first' => {
                       'second' => {
                                     'third' => {
                                                  'match' => '<!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->',
                                                  'content' => '
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        ',
                                                  'fourth' => {
                                                                'fifth' => {
                                                                             'match' => '<!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->',
                                                                             'content' => '
                    some html content here 5a
                    some html content here 5b
                '
                                                                           },
                                                                'match' => '<!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->',
                                                                'content' => '
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            ',
                                                                '#next' => 'fifth'
                                                              },
                                                  '#next' => 'fourth'
                                                },
                                     'match' => '<!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->',
                                     'content' => '
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    ',
                                     '#next' => 'third'
                                   },
                       'match' => '<!--block:first-->
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->',
                       'content' => '
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
',
                       '#next' => 'second'
                     },
          'six' => {
                     'match' => '<!--block:six-->
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
<!--endblock-->',
                     'content' => '
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
',
                     'seven' => {
                                  'match' => '<!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->',
                                  'content' => '
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    ',
                                  'eight' => {
                                               'match' => '<!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->',
                                               'content' => '
            some html content here 8a
            some html content here 8b
        '
                                             },
                                  '#next' => 'eight'
                                },
                     '#next' => 'seven'
                   }
        };

答案 2 :(得分:1)

我必须为你和其他可能找到这个帖子的人重复一遍,不要以如此复杂的方式使用正则表达式。

我喜欢正则表达式,但它们不是为这类问题而设计的。使用像Template::Toolkit这样的标准模板系统,你将好1000倍。

在此上下文中使用正则表达式的问题是将解析与验证结合起来的趋势。通过这样做,正则表达式最终变得非常脆弱,人们通常完全跳过对其数据的验证。例如,当递归正则表达式看到((( ))时,它会声称这些括号只有2个级别。事实上,有2和1/2,而1/2是一个不会报告的错误。

现在,我已经在我对其他两个问题的回答中传达了避免正则表达式解析中的这个缺陷的方法:

基本上,让你的解析正则表达式尽可能简单。这有多种用途。它确保您的正则表达式不那么脆弱,并且还鼓励不将验证放在解析阶段。

我向您展示了如何在上面的第二个解决方案中启动此特定的stackoverflow问题。基本上,将数据标记化,然后将结果转换为更复杂的数据结构。我今天有空闲时间,所以决定真正充分展示如何轻松完成翻译:

use strict;
use warnings;

use Data::Dump qw(dump dd);

my $content = do {local $/; <DATA>};

# Tokenize Content
my @tokens = split m{<!--(?:block:(.*?)|(endblock))-->}, $content;

# Resulting Data Structure
my @data = (
    shift @tokens, # First element of split is always HTML
);

# Keep track of levels of content
# - This is a throwaway data structure to facilitate the building of nested content
my @levels = ( \@data );

while (@tokens) {
    # Tokens come in groups of 3.  Two capture groups in split delimiter, followed by html.
    my ($block, $endblock, $html) = splice @tokens, 0, 3;

    # Start of Block - Go up to new level
    if (defined $block) {
        #debug# print +('  ' x @levels) ."<$block>\n";
        my $hash = {
            block    => $block,
            content  => [],
        };
        push @{$levels[-1]}, $hash;
        push @levels, $hash->{content};

    # End of Block - Go down level
    } elsif (defined $endblock) {
        die "Error: Unmatched endblock found before " . dump($html) if @levels == 1;
        pop @levels;
        #debug# print +('  ' x @levels) . "</$levels[-1][-1]{block}>\n";
    }

    # Append HTML content
    push @{$levels[-1]}, $html;
}
die "Error: Unmatched start block: $levels[-2][-1]{block}" if @levels > 1;

dd @data;

__DATA__

some html content here top base
<!--block:first-->
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here1-5 bottom base

some html content here 6-8 top base
<!--block:six-->
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
<!--endblock-->
some html content here 6-8 bottom base

如果取消注释调试语句,您将观察以下遍历标记以构建所需的结构:

  <first>
    <second>
      <third>
        <fourth>
          <fifth>
          </fifth>
        </fourth>
      </third>
    </second>
  </first>
  <six>
    <seven>
      <eight>
      </eight>
    </seven>
  </six>

完全结果的数据结构是:

(
    "\nsome html content here top base\n",
    {
        block   => "first",
        content => [
            "\n    <table border=\"1\" style=\"color:red;\">\n    <tr class=\"lines\">\n        <td align=\"left\" valign=\"<--valign-->\">\n    <b>bold</b><a href=\"http://www.mewsoft.com\">mewsoft</a>\n    <!--hello--> <--again--><!--world-->\n    some html content here 1 top\n    ",
            {
                block   => "second",
                content => [
                    "\n        some html content here 2 top\n        ",
                    {
                        block   => "third",
                        content => [
                            "\n            some html content here 3 top\n            ",
                            {
                                block   => "fourth",
                                content => [
                                    "\n                some html content here 4 top\n                ",
                                    {
                                        block   => "fifth",
                                        content => [
                                            "\n                    some html content here 5a\n                    some html content here 5b\n                ",
                                        ],
                                    },
                                    "\n            ",
                                ],
                            },
                            "\n            some html content here 3a\n            some html content here 3b\n        ",
                        ],
                    },
                    "\n        some html content here 2 bottom\n    ",
                ],
            },
            "\n    some html content here 1 bottom\n",
        ],
    },
    "\nsome html content here1-5 bottom base\n\nsome html content here 6-8 top base\n",
    {
        block   => "six",
        content => [
            "\n    some html content here 6 top\n    ",
            {
                block   => "seven",
                content => [
                    "\n        some html content here 7 top\n        ",
                    {
                        block   => "eight",
                        content => [
                            "\n            some html content here 8a\n            some html content here 8b\n        ",
                        ],
                    },
                    "\n        some html content here 7 bottom\n    ",
                ],
            },
            "\n    some html content here 6 bottom\n",
        ],
    },
    "\nsome html content here 6-8 bottom base",
);

现在,为什么这种方法更好?

它不那么脆弱。你已经观察到your previous regex在内容中有其他HTML评论时如何被破坏。用于解析的工具非常简单,因此正则表达式隐藏边缘情况的风险要小得多。

此外,向此代码添加功能非常容易。如果您想在块中包含参数,可以使用与我的this problem解决方案中演示的完全相同的方式。甚至不需要更改解析和验证功能。

报告错误从'endblock'或'block'中删除字符,看看会发生什么。它会给你一个明确的错误信息:

Error: Unmatched start block: first at h.pl line 43

您的递归正则表达式只会隐藏您的内容中存在无法匹配的块的事实。您当然可以在运行代码时在浏览器中观察它,但这样会立即报告错误,您可以将其跟踪。

<强>要点:

最后,我要再说一遍,解决这个问题的最佳方法不是尝试创建自己的模板系统,而​​是使用已经创建的框架,例如Template::Toolkit。您之前评论过,您的一个动机是您希望为模板使用设计编辑器,这就是为什么您希望他们对模板使用html注释。但是,现有框架也有办法满足这种愿望。

无论如何,我希望你能从这段代码中学到一些东西。递归正则表达式是很酷的工具,非常适合验证数据。但是它们不应该被用于解析,并且希望任何正在搜索如何使用递归正则表达式的人都会暂停,并且如果他们因为这个原因需要它们,可能会重新考虑他们的方法。

答案 3 :(得分:1)

我要添加一个额外的答案。它与我以前的答案一致,但稍微多一点 完成,我不想再回答这个问题了。

这是针对@daliaessam的,是对@Miller轶事的一种特殊回应,用于递归解析
使用正则表达式。

只需要考虑3个部分。所以,使用我之前的表现形式,我向你们展示了一个人 关于如何做到这一点的模板。它并不像你想象的那么难。

干杯!

 # //////////////////////////////////////////////////////
 # // The General Guide to 3-Part Recursive Parsing
 # // ----------------------------------------------
 # // Part 1. CONTENT
 # // Part 2. CORE
 # // Part 3. ERRORS

 (?is)

 (?:
      (                                  # (1), Take off CONTENT
           (?&content) 
      )
   |                                   # OR
      (?>                                # Start-Delimiter (in this case, must be atomic because of .*?)
           <!--block:
           ( .*? )                            # (2), Block name
           -->
      )
      (                                  # (3), Take off The CORE
           (?&core) 
        |  
      )
      <!--endblock-->                    # End-Delimiter

   |                                   # OR
      (                                  # (4), Take off Unbalanced (delimeter) ERRORS
           <!--
           (?: block: .*? | endblock )
           -->
      )
 )

 # ///////////////////////
 # // Subroutines
 # // ---------------

 (?(DEFINE)

      # core
      (?<core>
           (?>
                (?&content) 
             |  
                (?> <!--block: .*? --> )
                # recurse core
                (?:
                     (?&core) 
                  |  
                )
                <!--endblock-->
           )+
      )

      # content 
      (?<content>
           (?>
                (?!
                     <!--
                     (?: block: .*? | endblock )
                     -->
                )
                . 
           )+
      )

 )

Perl代码:

use strict;
use warnings;

use Data::Dumper;

$/ = undef;
my $content = <DATA>;

# Set the error mode on/off here ..
my $BailOnError = 1;
my $IsError = 0;

my $href = {};

ParseCore( $href, $content );

#print Dumper($href);

print "\n\n";
print "\nBase======================\n";
print $href->{content};
print "\nFirst======================\n";
print $href->{first}->{content};
print "\nSecond======================\n";
print $href->{first}->{second}->{content};
print "\nThird======================\n";
print $href->{first}->{second}->{third}->{content};
print "\nFourth======================\n";
print $href->{first}->{second}->{third}->{fourth}->{content};
print "\nFifth======================\n";
print $href->{first}->{second}->{third}->{fourth}->{fifth}->{content};
print "\nSix======================\n";
print $href->{six}->{content};
print "\nSeven======================\n";
print $href->{six}->{seven}->{content};
print "\nEight======================\n";
print $href->{six}->{seven}->{eight}->{content};

exit;


sub ParseCore
{
    my ($aref, $core) = @_;
    my ($k, $v);
    while ( $core =~ /(?is)(?:((?&content))|(?><!--block:(.*?)-->)((?&core)|)<!--endblock-->|(<!--(?:block:.*?|endblock)-->))(?(DEFINE)(?<core>(?>(?&content)|(?><!--block:.*?-->)(?:(?&core)|)<!--endblock-->)+)(?<content>(?>(?!<!--(?:block:.*?|endblock)-->).)+))/g )
    {
       if (defined $1)
       {
         # CONTENT
           $aref->{content} .= $1;
       }
       elsif (defined $2)
       {
         # CORE
           $k = $2; $v = $3;
           $aref->{$k} = {};
 #         $aref->{$k}->{content} = $v;
 #         $aref->{$k}->{match} = $&;

           my $curraref = $aref->{$k};
           my $ret = ParseCore($aref->{$k}, $v);
           if ( $BailOnError && $IsError ) {
               last;
           }
           if (defined $ret) {
               $curraref->{'#next'} = $ret;
           }
       }
       else
       {
         # ERRORS
           print "Unbalanced '$4' at position = ", $-[0];
           $IsError = 1;

           # Decide to continue here ..
           # If BailOnError is set, just unwind recursion. 
           # -------------------------------------------------
           if ( $BailOnError ) {
              last;
           }
       }
    }
    return $k;
}

#================================================
__DATA__
some html content here top base
<!--block:first-->
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here1-5 bottom base

some html content here 6-8 top base
<!--block:six-->
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
<!--endblock-->
some html content here 6-8 bottom base

输出&gt;&gt;

Base======================
some html content here top base

some html content here1-5 bottom base

some html content here 6-8 top base

some html content here 6-8 bottom base

First======================

    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top

    some html content here 1 bottom

Second======================

        some html content here 2 top

        some html content here 2 bottom

Third======================

            some html content here 3 top

            some html content here 3a
            some html content here 3b

Fourth======================

                some html content here 4 top


Fifth======================

                    some html content here 5a
                    some html content here 5b

Six======================

    some html content here 6 top

    some html content here 6 bottom

Seven======================

        some html content here 7 top

        some html content here 7 bottom

Eight======================

            some html content here 8a
            some html content here 8b