为什么。*消耗这个Perl正则表达式中的整个字符串?

时间:2009-08-24 17:12:12

标签: perl regex

为什么第一个print语句没有输出我期望的内容:

first = This is a test string, sec = This is a test string 

由于*和+都是贪婪的,为什么内部*即在“(”中的第一个匹配中不占用整个字符串?

use strict;
use warnings;

my $string = "This is a test string";
$string =~ /((.*)*)/; 
print "first = $1, sec = $2\n";  #prints "first = This is a test string, sec ="

$string =~ /((.+)*)/;
print "first = $1, sec = $2\n";  #prints "first = This is a test string, sec = This is a test string"

4 个答案:

答案 0 :(得分:17)

在第一个正则表达式中.*匹配两次。第一次匹配整个字符串。第二次匹配末尾的空字符串,因为.*匹配空字符串时没有其他内容可供匹配。

其他正则表达式不会发生这种情况,因为.+与空字符串不匹配。

编辑:至于其中的内容:$ 2将包含上次应用.* / .+时匹配的内容。 $ 1将包含(.*)* / (.+)*匹配的内容,即整个字符串。

答案 1 :(得分:14)

使用“use re 'debug'”运行会导致:

Compiling REx "((.*)*)"
Final program:
   1: OPEN1 (3)
   3:   CURLYX[0] {0,32767} (12)
   5:     OPEN2 (7)
   7:       STAR (9) # <====
   8:         REG_ANY (0)
   9:     CLOSE2 (11)
  11:   WHILEM[1/1] (0)
  12:   NOTHING (13)
  13: CLOSE1 (15)
  15: END (0)
minlen 0 

Matching REx "((.*)*)" against "This is a test string"
   0 <> <This is a >         |  1:OPEN1(3)
   0 <> <This is a >         |  3:CURLYX[0] {0,32767}(12)
   0 <> <This is a >         | 11:  WHILEM[1/1](0)
                                    whilem: matched 0 out of 0..32767
   0 <> <This is a >         |  5:    OPEN2(7)
   0 <> <This is a >         |  7:    STAR(9) # <====
                                      REG_ANY can match 21 times out of 2147483647...
  21 < test string> <>       |  9:      CLOSE2(11)
  21 < test string> <>       | 11:      WHILEM[1/1](0)
                                        whilem: matched 1 out of 0..32767
  21 < test string> <>       |  5:        OPEN2(7)
  21 < test string> <>       |  7:        STAR(9) # <====

  # This is where the outputs really start to diverge
  # --------------------------------------------------------------------------------------------
                                          REG_ANY can match 0 times out of 2147483647...
  21 < test string> <>       |  9:          CLOSE2(11) # <==== Succeeded
  21 < test string> <>       | 11:          WHILEM[1/1](0)
                                            whilem: matched 2 out of 0..32767
                                            whilem: empty match detected, trying continuation...
  # --------------------------------------------------------------------------------------------

  21 < test string> <>       | 12:            NOTHING(13)
  21 < test string> <>       | 13:            CLOSE1(15)
  21 < test string> <>       | 15:            END(0)
Match successful!

Compiling REx "((.+)*)"
Final program:
   1: OPEN1 (3)
   3:   CURLYX[0] {0,32767} (12)
   5:     OPEN2 (7)
   7:       PLUS (9) # <====
   8:         REG_ANY (0)
   9:     CLOSE2 (11)
  11:   WHILEM[1/1] (0)
  12:   NOTHING (13)
  13: CLOSE1 (15)
  15: END (0)
minlen 0 

Matching REx "((.+)*)" against "This is a test string"
   0 <> <This is a >         |  1:OPEN1(3)
   0 <> <This is a >         |  3:CURLYX[0] {0,32767}(12)
   0 <> <This is a >         | 11:  WHILEM[1/1](0)
                                    whilem: matched 0 out of 0..32767
   0 <> <This is a >         |  5:    OPEN2(7)
   0 <> <This is a >         |  7:    PLUS(9) # <====
                                      REG_ANY can match 21 times out of 2147483647...
  21 < test string> <>       |  9:      CLOSE2(11)
  21 < test string> <>       | 11:      WHILEM[1/1](0)
                                        whilem: matched 1 out of 0..32767
  21 < test string> <>       |  5:        OPEN2(7)
  21 < test string> <>       |  7:        PLUS(9) # <====

  # This is where the outputs really start to diverge
  # ------------------------------------------------------------------------------------
                                          REG_ANY can match 0 times out of 2147483647...
                                          failed... # <==== Failed
                                        whilem: failed, trying continuation...
  # ------------------------------------------------------------------------------------

  21 < test string> <>       | 12:        NOTHING(13)
  21 < test string> <>       | 13:        CLOSE1(15)
  21 < test string> <>       | 15:        END(0)
Match successful!

答案 2 :(得分:3)

第一个正则表达式的问题是()*仅保存最后一个匹配而.*匹配空字符串(即没有)的事实的组合。所以,给定

"aaab" =~ /(.)*/;

$1将为"b"。如果将该行为与.*匹配空字符串的事实相结合,您可以看到内部捕获有两个匹配:“这是一个测试字符串”和“”。由于空字符串排在最后,因此会保存到$2$1是整个捕获,因此它等同于"This is a test string" . ""。第二种情况可以按预期工作,因为.+与空字符串不匹配。

答案 3 :(得分:3)

我没有答案,但我确实有不同的方法来解决问题,使用更简单,也许更真实的正则表达式。

前两个示例的行为完全符合我的预期:.*使用整个字符串,而正则表达式返回只包含一个元素的列表。但是第三个正则表达式返回一个包含2个元素的列表。

use strict;
use warnings;
use Data::Dumper;

$_ = "foo";
print Dumper( [ /^(.*)/g ] ); # ('foo')     As expected.
print Dumper( [ /.(.*)/g ] ); # ('oo')      As expected.
print Dumper( [ /(.*)/g  ] ); # ('foo', '') Why?

到目前为止,许多答案都强调.*会匹配任何内容。虽然这是真的,但这个反应并不是问题的核心,这就是:为什么正常表达式引擎在.*消耗了整个字符串之后仍在寻找?在其他情况下(例如前两个示例),.*不会引入额外的空字符串以进行测量。

在Chas的有用评论之后更新。欧文斯即可。对三个示例中的任何一个进行的第一次评估都会导致.*匹配整个字符串。如果我们可以干预并在那一刻调用pos(),那么引擎确实会在字符串的末尾(至少当我们看到字符串时;请参阅Chas的评论。以获得对此的更多见解)。但是,/g选项告诉Perl再次尝试匹配整个正则表达式。示例#1和#2的第二次尝试将失败,并且该失败将导致引擎停止搜索。但是,使用正则表达式#3,引擎将获得另一个匹配:空字符串。然后/g选项告诉引擎再次尝试整个模式。现在确实没有什么可以匹配 - 既不是常规字符也不是尾随空字符串 - 所以过程停止。