Question

当看到this question时，开始更深入地考虑捕获组。

例如，有下一个（示例）输入：

text (aaa (text) ccc) 
text ( aaa (text) ccc )
text ( ' aaa (text) ccc ' ) 
text ( " aaa (text) ccc " )
text (aaa ( ' text ' ) ccc) 
text ( aaa ( ' text ' ) ccc )
text ( ' aaa ( " text " ) ccc ' ) 
text ( " aaa ( ' text ' ) ccc " )

并希望捕获任意代替aaa text（位于中间）和ccc的内容，以便获得所需结果：

=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=

我有3个正则表达式解决方案：

use strict;
use warnings;

while(<DATA>){
    chomp;
    m/
        .*?     #non greedy anything up to
        text    #the first "text"
        \s*     #optional spaces
        \(      #opening (
        (.*)    #content inside () greedy -> $1
        \)      #closing )
        \s*$
    /x;

    #processing only the captured content with removed outside ()
    #remove outside ' or " and extra spaces
    my $inside = $1;
    $inside =~ m/
                #at the begining of "line"
        ^\s*    #optional spaces
        ["']?   #optional " or '
        \s*     #optional spaces

        (.*?)   #content - non greedy -> $1

                #at the end of "line"
        \s*     #optional spaces before the closing ' "
        ['"]?   #optional closing " or '
        \s*$    #optionalny spaces
    /x;

    $inside = $1;
    $inside =~ m/
        ^(\w+)  #any word at the start -> $1
        \s*     #optional spaces
        \(      #opening (
        \s*     #optional spaces
        ['"]?   #optional ' or "
        \s*     #spaces
        (.*?)   #the content inside ' " -> $2
        \s*     #any spaces
        ['"]?   #optional "'
        \s*     #sp
        \)      #closing )
        \s*     #spaces
        (\w+)$  #word at the end -> $3
    /x;

    print "=$1= =$2= =$3=\n";
}
__DATA__
text (aaa (text) ccc) 
text ( aaa (text) ccc )
text ( ' aaa (text) ccc ' ) 
text ( " aaa (text) ccc " )
text (aaa ( ' text ' ) ccc) 
text ( aaa ( ' text ' ) ccc )
text ( ' aaa ( " text " ) ccc ' ) 
text ( " aaa ( ' text ' ) ccc " )

问题：

是否可以将所有3个正则表达式加入一个？
如果是，这可能是普遍的吗？那么有可能将 ANY 后续匹配的正则数与捕获组合并到一个正则表达式中？（仅表示m//与捕获组匹配，而不是后续替换正则表达式）
如果是，何时更希望使用一个正则表达式而不是更多？什么时候是更快的正则表达和一个大的？

Ps：我知道Text::Ballanced的存在，但这个问题更多的是关于“正则表达式的可能性”。

Answer 1

通常，您不能只将正则表达式组合在一起。有时你可以，有时，你不能。通常情况下，正则表达式最终会变长。例如，对于你上面的那些，你可以使用这样的东西：

^\w+\s*\(\s*(?:(')|("))?\s*(\w+)\s*\(\s*((?(1)"|'))?\s*(\w+)\s*\4?\s*\)\s*(\w+)\s*(?(1)'|")?\s*\)$

Regex101 demo

以上还确保使用正确的引号（例如，不能在双引号内使用双引号）。所需的群组位于$3，$5和$6。还有。关于ideone的一个例子。

我只会评论一些部分：

^\w+         # Beginning + function name
\s*
\(
\s*
(?:(')|("))? # Capture either single or double quote
\s*
(\w+)
\s*
\(
\s*
((?(1)"|'))? # If a single quote was captured, now match double, and vice versa. Capture
\s*
(\w+)
\s*
\4?          # Use the 4th capture from above comment
\s*
\)
\s*
(\w+)
\s*
(?(1)'|")?   # Use what was used in first quoting character
\s*
\)$

是否更需要使用一个或多个正则表达式取决于用户。如果他们能够在一个人中做到并且他们并不担心它被维护，那么肯定。

如果他们能够合而为一并仍然能够解释清楚，为什么不呢？

需要注意的是，正则表达式越长，就越容易出错，灾难性的回溯以及难以理解的越多。

较长的正则表达式可能不一定比较小的正则表达式慢。有些工具可以使事情表现得更快;原子群，占有量词，否定类是一些。

Answer 2

怎么样：

while(<DATA>){
    chomp;
    /text \((['" ]*)(\w+)\s*\((['" ]*)(\w+)\3\)\s*(\w+)\1\)/ ;
    say "=$2= =$4= =$5=";
}
__DATA__
text (aaa (text) ccc) 
text ( aaa (text) ccc )
text ( ' aaa (text) ccc ' ) 
text ( " aaa (text) ccc " )
text (aaa ( ' text ' ) ccc) 
text ( aaa ( ' text ' ) ccc )
text ( ' aaa ( " text " ) ccc ' ) 
text ( " aaa ( ' text ' ) ccc " )

<强>输出：

=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=
=aaa= =text= =ccc=

是否可以将多个后续捕获组匹配正则表达式合并为一个？

2 个答案: