我有这个HTML代码块:
some html content here top base
<!--block:first-->
some html content here 1 top
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
some html content here 1 bottom
<!--endblock-->
some html content here bottom base
我得到了这个正则表达式来匹配嵌套块:
/(?is)<!--block:(.*?)-->((?:(?:(?!<!--(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+)/g
这项工作正常,但如果任何块内容中有注释,则会中断。
这会因为第一场比赛中的<!--comment-->
而失败,但剩下的比赛将正常运作:
<!--block:first-->
some html content here 1 top
this <!--comment--> will make it fail here.
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
some html content here 1 bottom
<!--endblock-->
some html content here bottom base
这是对此question的跟进。
下面的Perl测试代码:
use Data::Dumper;
$/ = undef;
my $content = <DATA>;
my %blocks = ();
$blocks{'base'} = [];
ParseCore( $blocks{'base'}, $content );
sub ParseCore
{
my ($aref, $core) = @_;
while ( $core =~ /(?is)<!--block:(.*?)-->((?:(?:(?!<!--(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+)/g )
{
if ( defined $1 )
{
my $branch = {};
push @{$aref}, $branch;
$branch->{$1} = [];
ParseCore( $branch->{$1}, $2 );
}
elsif ( defined $3 )
{
push @{$aref}, $3;
}
}
}
print Dumper(\%blocks);
__DATA__
some html content here top base
<!--block:first-->
some html content here 1 top
this <!--comment--> will make it fail here.
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
some html content here 1 bottom
<!--endblock-->
some html content here bottom base
答案 0 :(得分:4)
我知道你一定厌倦了听到这个:但你做错了。
我喜欢正则表达式,但它们不是为这类问题而设计的。使用像Template::Toolkit
这样的标准模板系统,你将好1000倍。
如果您坚持使用这种方法,那么我建议您使用更简单的工具。而不是试图让正则表达式强制执行所有规则,而是使用最基本的正则表达式。在这种情况下,我建议您使用split
:
use strict;
use warnings;
my $content = do {local $/; <DATA>};
my @tokens = split /(<!--(?:block:.*?|endblock)-->)/, $content;
use Data::Dump;
dd \@tokens;
__DATA__
some html content here top base
<!--block:first-->
some html content here 1 top
this <!--comment--> will make it fail here.
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
some html content here 1 bottom
<!--endblock-->
some html content here bottom base
输出:
[
"\nsome html content here top base\n",
"<!--block:first-->",
"\n some html content here 1 top\n this <!--comment--> will make it fail here.\n ",
"<!--block:second-->",
"\n some html content here 2 top\n ",
"<!--block:third-->",
"\n some html content here 3a\n some html content here 3b\n ",
"<!--endblock-->",
"\n some html content here 2 bottom\n ",
"<!--endblock-->",
"\n some html content here 1 bottom\n",
"<!--endblock-->",
"\nsome html content here bottom base",
]
如您所见,数组包含文本与您匹配的模式之间的替换。
现在,我不知道你的最终目标是什么,也不知道你最终想要的数据格式,所以我不能从这里提出任何建议。但是,如果实际满足您的需求,您可以非常轻松地重新创建原始数据结构。更好的是,您实际上可以执行错误检查,查找块而不匹配打开或关闭,这是您的原始正则表达式会隐藏的。
答案 1 :(得分:0)
尽管建议使用模板或解析器模块,但我保证没有一个模块可以直接处理这个模块。
这是我提出的解决方案。
首先,我使用简单的正则表达式找到整个内容或模板中的外部块:
/(<!--block:.*?-->(?:(?:(?!<!--(?:.*?)-->).)|(?R))*?<!--endblock-->)/gis
然后我解析每个外部块以找到它的嵌套子块。
/(<!--block:(.*?)-->((?:(?:(?!<!--block:(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+))/igsx
然后一切都运转良好。我测试了两个外部块,每个块都有嵌套块。
我可以像这样到达任何子块:
print $blocks->{first}->{content};
print $blocks->{first}->{match};
print $blocks->{first}->{second}->{third}->{fourth}->{content}
每个块散列引用都有键:
`content`: the block content without the block name and endblock tags.
`match`: the block content with the block name and endblock tags, good for replacing.
`#next`: has the sub block name if exists, good to check if block has children and access them.
以下是最终的Perl测试和工作代码。
use Data::Dumper;
$/ = undef;
my $content = <DATA>;
my $blocks = parse_blocks($content);
print Dumper($blocks);
#print join "\n", keys( %{$blocks->{first}}); # root blocks names
#print join "\n", keys( %{$blocks->{first}}); #
#print join "\n", keys( %{$blocks->{first}->{second}});
#print Dumper $blocks->{first};
#print Dumper $blocks->{first}->{content};
#print Dumper $blocks->{first}->{match};
# check if fourth block has sub block.
#print exists $blocks->{first}->{second}->{third}->{fourth}->{'#next'}, "\n";
# check if block has sub block, get it:
#if (exists $blocks->{first}->{second}->{third}->{fourth}->{'#next'}) {
# print $blocks->{first}->{second}->{third}->{fourth}->{ $blocks->{first}->{second}->{third}->{fourth}->{'#next'} }->{content}, "\n";
#}
exit;
#================================================
sub parse_blocks {
my ($content) = @_;
my $href = {};
# find outer blocks only
while ($content =~ /(<!--block:.*?-->(?:(?:(?!<!--(?:.*?)-->).)|(?R))*?<!--endblock-->)/gis) {
# parse each outer block nested blocks
parse_nest_blocks($href, $1);
}
return $href;
}
#================================================
sub parse_nest_blocks {
my ($aref, $core) = @_;
my ($k, $v);
while ( $core =~ /(<!--block:(.*?)-->((?:(?:(?!<!--block:(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+))/igsx )
{
if (defined $2) {
$k = $2; $v = $3;
$aref->{$k} = {};
$aref->{$k}->{content} = $v;
$aref->{$k}->{match} = $1;
#print "1:{{$k}}\n2:[[$v]]\n";
my $curraref = $aref->{$k};
my $ret = parse_nest_blocks($aref->{$k}, $v);
if ($ret) {
$curraref->{'#next'} = $ret;
}
return $k;
}
}
}
#================================================
__DATA__
some html content here top base
<!--block:first-->
<table border="1" style="color:red;">
<tr class="lines">
<td align="left" valign="<--valign-->">
<b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
<!--hello--> <--again--><!--world-->
some html content here 1 top
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3 top
<!--block:fourth-->
some html content here 4 top
<!--block:fifth-->
some html content here 5a
some html content here 5b
<!--endblock-->
<!--endblock-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
some html content here 1 bottom
<!--endblock-->
some html content here1-5 bottom base
some html content here 6-8 top base
<!--block:six-->
some html content here 6 top
<!--block:seven-->
some html content here 7 top
<!--block:eight-->
some html content here 8a
some html content here 8b
<!--endblock-->
some html content here 7 bottom
<!--endblock-->
some html content here 6 bottom
<!--endblock-->
some html content here 6-8 bottom base
并且整个哈希转储的输出是:
$VAR1 = {
'first' => {
'second' => {
'third' => {
'match' => '<!--block:third-->
some html content here 3 top
<!--block:fourth-->
some html content here 4 top
<!--block:fifth-->
some html content here 5a
some html content here 5b
<!--endblock-->
<!--endblock-->
some html content here 3a
some html content here 3b
<!--endblock-->',
'content' => '
some html content here 3 top
<!--block:fourth-->
some html content here 4 top
<!--block:fifth-->
some html content here 5a
some html content here 5b
<!--endblock-->
<!--endblock-->
some html content here 3a
some html content here 3b
',
'fourth' => {
'fifth' => {
'match' => '<!--block:fifth-->
some html content here 5a
some html content here 5b
<!--endblock-->',
'content' => '
some html content here 5a
some html content here 5b
'
},
'match' => '<!--block:fourth-->
some html content here 4 top
<!--block:fifth-->
some html content here 5a
some html content here 5b
<!--endblock-->
<!--endblock-->',
'content' => '
some html content here 4 top
<!--block:fifth-->
some html content here 5a
some html content here 5b
<!--endblock-->
',
'#next' => 'fifth'
},
'#next' => 'fourth'
},
'match' => '<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3 top
<!--block:fourth-->
some html content here 4 top
<!--block:fifth-->
some html content here 5a
some html content here 5b
<!--endblock-->
<!--endblock-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->',
'content' => '
some html content here 2 top
<!--block:third-->
some html content here 3 top
<!--block:fourth-->
some html content here 4 top
<!--block:fifth-->
some html content here 5a
some html content here 5b
<!--endblock-->
<!--endblock-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
',
'#next' => 'third'
},
'match' => '<!--block:first-->
<table border="1" style="color:red;">
<tr class="lines">
<td align="left" valign="<--valign-->">
<b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
<!--hello--> <--again--><!--world-->
some html content here 1 top
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3 top
<!--block:fourth-->
some html content here 4 top
<!--block:fifth-->
some html content here 5a
some html content here 5b
<!--endblock-->
<!--endblock-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
some html content here 1 bottom
<!--endblock-->',
'content' => '
<table border="1" style="color:red;">
<tr class="lines">
<td align="left" valign="<--valign-->">
<b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
<!--hello--> <--again--><!--world-->
some html content here 1 top
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3 top
<!--block:fourth-->
some html content here 4 top
<!--block:fifth-->
some html content here 5a
some html content here 5b
<!--endblock-->
<!--endblock-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
some html content here 1 bottom
',
'#next' => 'second'
},
'six' => {
'match' => '<!--block:six-->
some html content here 6 top
<!--block:seven-->
some html content here 7 top
<!--block:eight-->
some html content here 8a
some html content here 8b
<!--endblock-->
some html content here 7 bottom
<!--endblock-->
some html content here 6 bottom
<!--endblock-->',
'content' => '
some html content here 6 top
<!--block:seven-->
some html content here 7 top
<!--block:eight-->
some html content here 8a
some html content here 8b
<!--endblock-->
some html content here 7 bottom
<!--endblock-->
some html content here 6 bottom
',
'seven' => {
'match' => '<!--block:seven-->
some html content here 7 top
<!--block:eight-->
some html content here 8a
some html content here 8b
<!--endblock-->
some html content here 7 bottom
<!--endblock-->',
'content' => '
some html content here 7 top
<!--block:eight-->
some html content here 8a
some html content here 8b
<!--endblock-->
some html content here 7 bottom
',
'eight' => {
'match' => '<!--block:eight-->
some html content here 8a
some html content here 8b
<!--endblock-->',
'content' => '
some html content here 8a
some html content here 8b
'
},
'#next' => 'eight'
},
'#next' => 'seven'
}
};