当看到this question时,开始更深入地考虑捕获组。
例如,有下一个(示例)输入:
text (aaa (text) ccc)
text ( aaa (text) ccc )
text ( ' aaa (text) ccc ' )
text ( " aaa (text) ccc " )
text (aaa ( ' text ' ) ccc)
text ( aaa ( ' text ' ) ccc )
text ( ' aaa ( " text " ) ccc ' )
text ( " aaa ( ' text ' ) ccc " )
并希望捕获任意代替aaa
text
(位于中间)和ccc
的内容,以便获得所需结果:
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
我有3个正则表达式解决方案:
use strict;
use warnings;
while(<DATA>){
chomp;
m/
.*? #non greedy anything up to
text #the first "text"
\s* #optional spaces
\( #opening (
(.*) #content inside () greedy -> $1
\) #closing )
\s*$
/x;
#processing only the captured content with removed outside ()
#remove outside ' or " and extra spaces
my $inside = $1;
$inside =~ m/
#at the begining of "line"
^\s* #optional spaces
["']? #optional " or '
\s* #optional spaces
(.*?) #content - non greedy -> $1
#at the end of "line"
\s* #optional spaces before the closing ' "
['"]? #optional closing " or '
\s*$ #optionalny spaces
/x;
$inside = $1;
$inside =~ m/
^(\w+) #any word at the start -> $1
\s* #optional spaces
\( #opening (
\s* #optional spaces
['"]? #optional ' or "
\s* #spaces
(.*?) #the content inside ' " -> $2
\s* #any spaces
['"]? #optional "'
\s* #sp
\) #closing )
\s* #spaces
(\w+)$ #word at the end -> $3
/x;
print "=$1= =$2= =$3=\n";
}
__DATA__
text (aaa (text) ccc)
text ( aaa (text) ccc )
text ( ' aaa (text) ccc ' )
text ( " aaa (text) ccc " )
text (aaa ( ' text ' ) ccc)
text ( aaa ( ' text ' ) ccc )
text ( ' aaa ( " text " ) ccc ' )
text ( " aaa ( ' text ' ) ccc " )
问题:
m//
与捕获组匹配,而不是后续替换正则表达式)Ps:我知道Text::Ballanced的存在,但这个问题更多的是关于“正则表达式的可能性”。
答案 0 :(得分:2)
通常,您不能只将正则表达式组合在一起。有时你可以,有时,你不能。通常情况下,正则表达式最终会变长。例如,对于你上面的那些,你可以使用这样的东西:
^\w+\s*\(\s*(?:(')|("))?\s*(\w+)\s*\(\s*((?(1)"|'))?\s*(\w+)\s*\4?\s*\)\s*(\w+)\s*(?(1)'|")?\s*\)$
以上还确保使用正确的引号(例如,不能在双引号内使用双引号)。所需的群组位于$3
,$5
和$6
。还有。关于ideone的一个例子。
我只会评论一些部分:
^\w+ # Beginning + function name
\s*
\(
\s*
(?:(')|("))? # Capture either single or double quote
\s*
(\w+)
\s*
\(
\s*
((?(1)"|'))? # If a single quote was captured, now match double, and vice versa. Capture
\s*
(\w+)
\s*
\4? # Use the 4th capture from above comment
\s*
\)
\s*
(\w+)
\s*
(?(1)'|")? # Use what was used in first quoting character
\s*
\)$
是否更需要使用一个或多个正则表达式取决于用户。如果他们能够在一个人中做到并且他们并不担心它被维护,那么肯定。
如果他们能够合而为一并仍然能够解释清楚,为什么不呢?
需要注意的是,正则表达式越长,就越容易出错,灾难性的回溯以及难以理解的越多。
较长的正则表达式可能不一定比较小的正则表达式慢。有些工具可以使事情表现得更快;原子群,占有量词,否定类是一些。
答案 1 :(得分:0)
怎么样:
while(<DATA>){
chomp;
/text \((['" ]*)(\w+)\s*\((['" ]*)(\w+)\3\)\s*(\w+)\1\)/ ;
say "=$2= =$4= =$5=";
}
__DATA__
text (aaa (text) ccc)
text ( aaa (text) ccc )
text ( ' aaa (text) ccc ' )
text ( " aaa (text) ccc " )
text (aaa ( ' text ' ) ccc)
text ( aaa ( ' text ' ) ccc )
text ( ' aaa ( " text " ) ccc ' )
text ( " aaa ( ' text ' ) ccc " )
<强>输出:强>
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=