Question

我有一个正则表达式：

(<select([^>]*>))(.*?)(</select\s*>)

由于它使用延迟重复量词，对于较长的字符串（选项超过500），它回溯超过100,000次并失败。请帮助我找到一个更好的正则表达式，它不使用惰性重复量词

Answer 1

<select[^>]*>[^<]*(?:<(?!/select>)[^<]*)*</select>

......或以人类可读的形式：

<select[^>]*>    # start tag
[^<]*            # anything except opening bracket
(?:              # if you find an open bracket
  <(?!/select>)  #   match it if it's not part of end tag
  [^<]*          #   consume any more non-brackets
)*               # repeat as needed
</select>        # end tag

这是Friedl在他的书Mastering Regular Expressions中开发的“展开循环”技术的一个例子。我使用基于不情愿量词的模式在RegexBuddy中进行了快速测试：

(?s)<select[^>]*>.*?</select>

...找到一场比赛需要大约6,000步。 展开循环模式只需要500步。当我从结束标记（</select）中删除结束括号时，无法进行匹配，只需要800步即可报告失败。

如果你的正则表达式支持占有量量词，那么继续使用它们：

<select[^>]*+>[^<]*+(?:<(?!/select>)[^<]*+)*+</select>

实现匹配所需的步骤大致相同，但在此过程中可以使用更少的内存。如果不可能匹配，它会更快失败;在我的测试中，它花了大约500步，与查找匹配所用的数字相同。

Answer 2

不幸的是，这不会起作用，请参阅Alan Moore的答案以获得正确的示例！

(<select([^>]*>))(.*+)(</select\s*>)

来自perl regexp联机帮助页：

默认情况下，当量化的子模式不允许其余部分时要匹配的整体模式，Perl将会回溯。但是，这种行为有时不受欢迎。因此Perl提供了“占有欲” 量词形式也是如此。

       *+     Match 0 or more times and give nothing back
       ++     Match 1 or more times and give nothing back
       ?+     Match 0 or 1 time and give nothing back
       {n}+   Match exactly n times and give nothing back (redundant)
       {n,}+  Match at least n times and give nothing back
       {n,m}+ Match at least n but not more than m times and give nothing back

例如，

      'aaaa' =~ /a++a/

永远不会匹配，因为“a ++”将吞噬所有“a”中的“a” 字符串，不会留下任何图案的剩余部分。这个功能非常有用，可以提供关于它的perl提示不应该回溯。例如，典型的“匹配双引号” 字符串“当写为：
时，问题可以最有效地执行

      /"(?:[^"\\]++|\\.)*+"/

编写更好的正则表达式表达式，不使用惰性重复量词

2 个答案: