Question

为什么用\s*替换\s\s*（或甚至\s+）会导致此输入加速？

use Benchmark qw(:all);
$x=(" " x 100000) . "_\n";
$count = 100;
timethese($count, {
    '/\s\s*\n/' => sub { $x =~ /\s\s*\n/ },
    '/\s+\n/' => sub { $x =~ /\s+\n/ },
});

Link to online version

我注意到我的代码中有一个缓慢的正则表达式s/\s*\n\s*/\n/g - 当给出一个450KB的输入文件，其中包含大量空格，其中有一些非空格，最后一个换行符 - 正则表达式挂起，从未完成。

我直观地将正则表达式替换为s/\s+\n/\n/g; s/\n\s+/\n/g;，一切都很顺利。

但为什么它这么快？使用re Debug => "EXECUTE"后，我注意到\s+版本已经过某种优化，只能在一次迭代中运行：http://pastebin.com/0Ug6xPiQ

Matching REx "\s*\n" against "       _%n"
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d ][{non-utf8-latin1-all}{unicode_all}] against "       _%n" (9 bytes)
   0 <> <       _%n>         |  1:STAR(3)
                                  SPACE can match 7 times out of 2147483647...
                                  failed...
   1 < > <      _%n>         |  1:STAR(3)
                                  SPACE can match 6 times out of 2147483647...
                                  failed...
   2 <  > <     _%n>         |  1:STAR(3)
                                  SPACE can match 5 times out of 2147483647...
                                  failed...
   3 <   > <    _%n>         |  1:STAR(3)
                                  SPACE can match 4 times out of 2147483647...
                                  failed...
   4 <    > <   _%n>         |  1:STAR(3)
                                  SPACE can match 3 times out of 2147483647...
                                  failed...
   5 <     > <  _%n>         |  1:STAR(3)
                                  SPACE can match 2 times out of 2147483647...
                                  failed...
   6 <      > < _%n>         |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   8 <       _> <%n>         |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
   8 <       _> <%n>         |  3:  EXACT <\n>(5)
   9 <       _%n> <>         |  5:  END(0)
Match successful!

Matching REx "\s+\n" against "       _%n"
Matching stclass SPACE against "       _" (8 bytes)
   0 <> <       _%n>         |  1:PLUS(3)
                                  SPACE can match 7 times out of 2147483647...
                                  failed...

我知道如果不存在换行符，Perl 5.10+将立即失败正则表达式（不运行它）。我怀疑它正在使用换行符的位置来减少搜索量。对于上面的所有情况，它似乎巧妙地减少了所涉及的回溯（通常/\s*\n/对一串空格将采用指数时间）。任何人都可以深入了解\s+版本速度如此之快的原因吗？

另请注意，\s*?不提供任何加速。

Answer 1

首先，即使生成的正则表达式不能保持相同的含义，我们也要将正则表达式减少到\s*0和\s+0并使用(" " x 4) . "_0"作为输入。对于怀疑论者，您可以看到here滞后仍然存在。

现在让我们考虑以下代码：

$x = (" " x 4) . "_ 0";
$x =~ /\s*0/; # The slow line 
$x =~ /\s+0/; # The fast line

用use re debugcolor;挖掘一下我们得到以下输出：

Guessing start of match in sv for REx "\s*0" against "    _0"
Found floating substr "0" at offset 5...
start_shift: 0 check_at: 5 s: 0 endpos: 6 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s*0" against "    _0"
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes)
   0 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 4 times out of 2147483647...
                                  failed...
   1 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 3 times out of 2147483647...
                                  failed...
   2 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 2 times out of 2147483647...
                                  failed...
   3 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 1 times out of 2147483647...
                                  failed...
   5 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 0 times out of 2147483647...
   5 <    _0>|  3:  EXACT <0>(5)
   6 <    _0>|  5:  END(0)
Match successful!

-----------------------

Guessing start of match in sv for REx "\s+0" against "    _0"
Found floating substr "0" at offset 5...
start_shift: 1 check_at: 5 s: 0 endpos: 5 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s+0" against "    _0"
Matching stclass POSIXD[\s] against "    _" (5 bytes)
   0 <    _0>|  1:PLUS(3)
                                  POSIXD[\s] can match 4 times out of 2147483647...
                                  failed...
Contradicts stclass... [regexec_flags]
Match failed

Perl似乎be optimized for failure。它将首先查找常量字符串（仅消耗O（N））。在此，它会查找0：Found floating substr "0" at offset 5...

然后它将从正则表达式的变量部分开始，分别为\s*和\s+，以检查整个最小字符串：

Matching REx "\s*0" against "    _0"
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes)
Matching REx "\s+0" against "    _0"
Matching stclass POSIXD[\s] against "    _" (5 bytes) # Only 5 bytes because there should be at least 1 "\s" char

之后，它会在第0位找到满足stclass要求的第一个位置。

\s*0：
- 从0开始，找到4个空格然后失败;
- 从1开始，找到3个空格然后失败;
- 从2开始，找到2个空格然后失败;
- 从3开始，找到1个空格然后失败;
- 从4开始，找到0个空格然后不会失败;
- 找到确切的0
\s+0：
- 从0开始，找到4个空格然后失败。由于最小空格数不匹配，正则表达式会立即失败。

如果您想获得Perl正则表达式优化的乐趣，可以考虑以下正则表达式/ *\n和/ * \n。乍一看，它们看起来一样，具有相同的含义......但如果你对(" " x 40000) . "_\n"运行它，第一个将检查所有可能性，而第二个将查找" \n"并立即失败。

在一个普通的，非优化的正则表达式引擎中，两个正则表达式都可能导致灾难性的回溯，因为它们需要在碰撞时重试模式。但是，在上面的示例中，第二个没有使用Perl失败，因为它已经优化为find floating substr "0%n"

您可以在Jeff Atwood's blog上看到另一个示例。

另请注意，问题不在于\s考虑因素，而是使用xx*代替x+的任何模式，请参阅example with 0s以及regex explosive quantifiers

使用如此短的例子，行为是可以找到的，但是如果你开始玩复杂的模式，那么它很容易被发现，例如：Regular expression hangs program (100% CPU usage)

Answer 2

当模式开头有“加”节点（例如\s+）并且节点无法匹配时，正则表达式引擎会跳到故障点并再次尝试;另一方面，使用\s*，引擎一次只能前进一个字符。

Yves Orton很好地解释了这个优化here：

起始类优化有两种模式，“尝试每个有效的起始位置”（doevery）和“触发器模式”（！doevery），它只会尝试序列中的第一个有效起始位置。

考虑/（\ d +）X /和字符串“123456Y”，现在我们知道如果我们在匹配“123456”后未能匹配X，那么我们也将在“23456”之后无法匹配（假设没有邪恶的技巧在适当的位置，无论如何都禁用了优化），所以我们知道我们可以向前跳过直到检查/失败/然后才开始寻找真正的匹配。这是触发器模式。

/\s+/触发触发器模式; /\s*/，/\s\s*/和/\s\s+/没有。此优化不能应用于像\s*这样的“星形”节点，因为它们可以匹配零个字符，因此序列中某个点的失败并不表示稍后在同一序列中失败。

您可以在每个正则表达式的调试输出中看到这一点。我已使用^突出显示跳过的字符。比较一下（一次跳过四个字符）：

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d+x/'
   ...
   0 <> <123 456 78>         |  1:PLUS(3)
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   4 <123 > <456 789 x>      |  1:PLUS(3)
      ^^^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   8 <23 456 > <789 x>       |  1:PLUS(3)
         ^^^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...

到此（一次跳过一个或两个字符）：

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d*x/'
   ...
   0 <> <123 456 78>         |  1:STAR(3)
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   1 <1> <23 456 789>        |  1:STAR(3)
      ^
                                  POSIXD[\d] can match 2 times out of 2147483647...
                                  failed...
   2 <12> <3 456 789 >       |  1:STAR(3)
       ^
                                  POSIXD[\d] can match 1 times out of 2147483647...
                                  failed...
   4 <123 > <456 789 x>      |  1:STAR(3)
        ^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   5 <123 4> <56 789 x>      |  1:STAR(3)
          ^
                                  POSIXD[\d] can match 2 times out of 2147483647...
                                  failed...
   6 <23 45> <6 789 x>       |  1:STAR(3)
          ^
                                  POSIXD[\d] can match 1 times out of 2147483647...
                                  failed...
   8 <23 456 > <789 x>       |  1:STAR(3)
           ^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   9 <23 456 7> <89 x>       |  1:STAR(3)
             ^
                                  POSIXD[\d] can match 2 times out of 2147483647...
                                  failed...
  10 <23 456 78> <9 x>       |  1:STAR(3)
              ^
                                  POSIXD[\d] can match 1 times out of 2147483647...
                                  failed...
  12 <23 456 789 > <x>       |  1:STAR(3)
               ^^
                                  POSIXD[\d] can match 0 times out of 2147483647...
  12 <23 456 789 > <x>       |  3:  EXACT <x>(5)
  13 <23 456 789 x> <>       |  5:  END(0)

请注意，优化不适用于/\s\s+/，因为\s+不在模式的开头。但/\s\s+/（逻辑上等同于/\s{2,}/）和/\s\s*/（逻辑上等同于/\s+/）可能可以进行优化;在perl5-porters询问是否值得付出努力可能是有道理的。

_{如果您感兴趣，可以通过在编译时在正则表达式上设置PREGf_SKIP标志来启用“触发器模式”。请参阅regcomp.c中第7344行和第7405行的代码以及5.24.0源代码中regexec.c中第1585行的代码。}

Answer 3

\s+\n要求\n之前的字符为SPACE。

根据use re qw(debug)，编译确定它需要一个已知数量的空格的直字符串，直到子字符串\n，该字符串首先在输入中检查。然后它针对输入的剩余部分检查固定长度的仅空格子字符串，在_时失败。无论输入有多少空格，都可以检查它。（当有更多_\n时，根据调试输出，每个都被发现同样直接失败。）

以这种方式看待它，它是您几乎所期望的优化，利用相当具体的搜索模式并幸运地使用此输入。除非与其他引擎相比较，否则显然不会进行此类分析。

对于\s*\n，情况并非如此。找到\n并且前一个字符不是空格后，搜索没有失败，因为\s*不允许任何内容（零个字符）。也没有固定长度的子串，并且它在回溯游戏中。

Answer 4

我不确定正则表达式引擎的内部，但看起来它无法识别\s+在某种程度上相同作为\s\s*，因为在第二个中它匹配一个空格，然后尝试匹配越来越多的空格，而在第一个中，它立即得出结论，没有匹配。

使用use re qw( Debug );的输出使用更短的字符串清楚地显示了这一点：

<强> test_re.pl

#!/usr/bin/env perl
use re qw(debug);

$x=(" " x 10) . "_\n";
print '-'x50 . "\n";
$x =~ /\s+\n/;
print '-'x50 . "\n";
$x =~ /\s\s*\n/;
print '-'x50 . "\n";

<强>输出

Compiling REx "\s+\n"
Final program:
    1: PLUS (3)
    2:   SPACE (0)
    3: EXACT <\n> (5)
    5: END (0)
floating "%n" at 1..2147483647 (checking floating) stclass SPACE plus minlen 2
Compiling REx "\s\s*\n"
Final program:
    1: SPACE (2)
    2: STAR (4)
    3:   SPACE (0)
    4: EXACT <\n> (6)
    6: END (0)
floating "%n" at 1..2147483647 (checking floating) stclass SPACE minlen 2
--------------------------------------------------
Guessing start of match in sv for REx "\s+\n" against "          _%n"
Found floating substr "%n" at offset 11...
    start_shift: 1 check_at: 11 s: 0 endpos: 11
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s+\n" against "          _%n"
Matching stclass SPACE against "          _" (11 bytes)
   0 <> <          >         |  1:PLUS(3)
                                  SPACE can match 10 times out of 2147483647...
                                  failed...
Contradicts stclass... [regexec_flags]
Match failed
--------------------------------------------------
Guessing start of match in sv for REx "\s\s*\n" against "          _%n"
Found floating substr "%n" at offset 11...
    start_shift: 1 check_at: 11 s: 0 endpos: 11
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s\s*\n" against "          _%n"
Matching stclass SPACE against "          _" (11 bytes)
   0 <> <          >         |  1:SPACE(2)
   1 < > <         _>        |  2:STAR(4)
                                  SPACE can match 9 times out of 2147483647...
                                  failed...
   1 < > <         _>        |  1:SPACE(2)
   2 <  > <        _>        |  2:STAR(4)
                                  SPACE can match 8 times out of 2147483647...
                                  failed...
   2 <  > <        _>        |  1:SPACE(2)
   3 <   > <       _%n>      |  2:STAR(4)
                                  SPACE can match 7 times out of 2147483647...
                                  failed...
   3 <   > <       _%n>      |  1:SPACE(2)
   4 <    > <      _%n>      |  2:STAR(4)
                                  SPACE can match 6 times out of 2147483647...
                                  failed...
   4 <    > <      _%n>      |  1:SPACE(2)
   5 <     > <     _%n>      |  2:STAR(4)
                                  SPACE can match 5 times out of 2147483647...
                                  failed...
   5 <     > <     _%n>      |  1:SPACE(2)
   6 <      > <    _%n>      |  2:STAR(4)
                                  SPACE can match 4 times out of 2147483647...
                                  failed...
   6 <      > <    _%n>      |  1:SPACE(2)
   7 <       > <   _%n>      |  2:STAR(4)
                                  SPACE can match 3 times out of 2147483647...
                                  failed...
   7 <       > <   _%n>      |  1:SPACE(2)
   8 <        > <  _%n>      |  2:STAR(4)
                                  SPACE can match 2 times out of 2147483647...
                                  failed...
   8 <        > <  _%n>      |  1:SPACE(2)
   9 <         > < _%n>      |  2:STAR(4)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   9 <         > < _%n>      |  1:SPACE(2)
  10 <          > <_%n>      |  2:STAR(4)
                                  SPACE can match 0 times out of 2147483647...
                                  failed...
Contradicts stclass... [regexec_flags]
Match failed
--------------------------------------------------
Freeing REx: "\s+\n"
Freeing REx: "\s\s*\n"

为什么`\ s +`比这个Perl正则表达式中的`\ s \ s *`快得多？

4 个答案: