Question

为什么第一个print语句没有输出我期望的内容：

first = This is a test string, sec = This is a test string

由于*和+都是贪婪的，为什么内部*即在“（”中的第一个匹配中不占用整个字符串？

use strict;
use warnings;

my $string = "This is a test string";
$string =~ /((.*)*)/; 
print "first = $1, sec = $2\n";  #prints "first = This is a test string, sec ="

$string =~ /((.+)*)/;
print "first = $1, sec = $2\n";  #prints "first = This is a test string, sec = This is a test string"

Answer 1

在第一个正则表达式中.*匹配两次。第一次匹配整个字符串。第二次匹配末尾的空字符串，因为.*匹配空字符串时没有其他内容可供匹配。

其他正则表达式不会发生这种情况，因为.+与空字符串不匹配。

编辑：至于其中的内容：$ 2将包含上次应用.* / .+时匹配的内容。 $ 1将包含(.*)* / (.+)*匹配的内容，即整个字符串。

Answer 2

使用“use re 'debug'”运行会导致：

Compiling REx "((.*)*)"
Final program:
   1: OPEN1 (3)
   3:   CURLYX[0] {0,32767} (12)
   5:     OPEN2 (7)
   7:       STAR (9) # <====
   8:         REG_ANY (0)
   9:     CLOSE2 (11)
  11:   WHILEM[1/1] (0)
  12:   NOTHING (13)
  13: CLOSE1 (15)
  15: END (0)
minlen 0

Matching REx "((.*)*)" against "This is a test string"
   0 <> <This is a >         |  1:OPEN1(3)
   0 <> <This is a >         |  3:CURLYX[0] {0,32767}(12)
   0 <> <This is a >         | 11:  WHILEM[1/1](0)
                                    whilem: matched 0 out of 0..32767
   0 <> <This is a >         |  5:    OPEN2(7)
   0 <> <This is a >         |  7:    STAR(9) # <====
                                      REG_ANY can match 21 times out of 2147483647...
  21 < test string> <>       |  9:      CLOSE2(11)
  21 < test string> <>       | 11:      WHILEM[1/1](0)
                                        whilem: matched 1 out of 0..32767
  21 < test string> <>       |  5:        OPEN2(7)
  21 < test string> <>       |  7:        STAR(9) # <====

  # This is where the outputs really start to diverge
  # --------------------------------------------------------------------------------------------
                                          REG_ANY can match 0 times out of 2147483647...
  21 < test string> <>       |  9:          CLOSE2(11) # <==== Succeeded
  21 < test string> <>       | 11:          WHILEM[1/1](0)
                                            whilem: matched 2 out of 0..32767
                                            whilem: empty match detected, trying continuation...
  # --------------------------------------------------------------------------------------------

  21 < test string> <>       | 12:            NOTHING(13)
  21 < test string> <>       | 13:            CLOSE1(15)
  21 < test string> <>       | 15:            END(0)
Match successful!

Compiling REx "((.+)*)"
Final program:
   1: OPEN1 (3)
   3:   CURLYX[0] {0,32767} (12)
   5:     OPEN2 (7)
   7:       PLUS (9) # <====
   8:         REG_ANY (0)
   9:     CLOSE2 (11)
  11:   WHILEM[1/1] (0)
  12:   NOTHING (13)
  13: CLOSE1 (15)
  15: END (0)
minlen 0

Matching REx "((.+)*)" against "This is a test string"
   0 <> <This is a >         |  1:OPEN1(3)
   0 <> <This is a >         |  3:CURLYX[0] {0,32767}(12)
   0 <> <This is a >         | 11:  WHILEM[1/1](0)
                                    whilem: matched 0 out of 0..32767
   0 <> <This is a >         |  5:    OPEN2(7)
   0 <> <This is a >         |  7:    PLUS(9) # <====
                                      REG_ANY can match 21 times out of 2147483647...
  21 < test string> <>       |  9:      CLOSE2(11)
  21 < test string> <>       | 11:      WHILEM[1/1](0)
                                        whilem: matched 1 out of 0..32767
  21 < test string> <>       |  5:        OPEN2(7)
  21 < test string> <>       |  7:        PLUS(9) # <====

  # This is where the outputs really start to diverge
  # ------------------------------------------------------------------------------------
                                          REG_ANY can match 0 times out of 2147483647...
                                          failed... # <==== Failed
                                        whilem: failed, trying continuation...
  # ------------------------------------------------------------------------------------

  21 < test string> <>       | 12:        NOTHING(13)
  21 < test string> <>       | 13:        CLOSE1(15)
  21 < test string> <>       | 15:        END(0)
Match successful!

Answer 3

第一个正则表达式的问题是()*仅保存最后一个匹配而.*匹配空字符串（即没有）的事实的组合。所以，给定

"aaab" =~ /(.)*/;

$1将为"b"。如果将该行为与.*匹配空字符串的事实相结合，您可以看到内部捕获有两个匹配：“这是一个测试字符串”和“”。由于空字符串排在最后，因此会保存到$2。 $1是整个捕获，因此它等同于"This is a test string" . ""。第二种情况可以按预期工作，因为.+与空字符串不匹配。

Answer 4

我没有答案，但我确实有不同的方法来解决问题，使用更简单，也许更真实的正则表达式。

前两个示例的行为完全符合我的预期：.*使用整个字符串，而正则表达式返回只包含一个元素的列表。但是第三个正则表达式返回一个包含2个元素的列表。

use strict;
use warnings;
use Data::Dumper;

$_ = "foo";
print Dumper( [ /^(.*)/g ] ); # ('foo')     As expected.
print Dumper( [ /.(.*)/g ] ); # ('oo')      As expected.
print Dumper( [ /(.*)/g  ] ); # ('foo', '') Why?

到目前为止，许多答案都强调.*会匹配任何内容。虽然这是真的，但这个反应并不是问题的核心，这就是：为什么正常表达式引擎在.*消耗了整个字符串之后仍在寻找？在其他情况下（例如前两个示例），.*不会引入额外的空字符串以进行测量。

在Chas的有用评论之后更新。欧文斯即可。对三个示例中的任何一个进行的第一次评估都会导致.*匹配整个字符串。如果我们可以干预并在那一刻调用pos()，那么引擎确实会在字符串的末尾（至少当我们看到字符串时;请参阅Chas的评论。以获得对此的更多见解）。但是，/g选项告诉Perl再次尝试匹配整个正则表达式。示例＃1和＃2的第二次尝试将失败，并且该失败将导致引擎停止搜索。但是，使用正则表达式＃3，引擎将获得另一个匹配：空字符串。然后/g选项告诉引擎再次尝试整个模式。现在确实没有什么可以匹配 - 既不是常规字符也不是尾随空字符串 - 所以过程停止。

为什么。*消耗这个Perl正则表达式中的整个字符串？

4 个答案: