Question

我有一个像c.{0,2}?m这样的表达式和一个像"abcemtcmncefmf"这样的字符串。目前，它将匹配三个子字符串：cem，cm和cefm（see here）。但我喜欢只匹配其中最小的一个，在这种情况下，cm。

我的问题是我没有全局匹配支持，只有第一场比赛，因为我使用了MariaDB REGEXP_SUBSTR()功能。我目前的解决方案是我为解决问题而创建的stored procedure。但它比简单情况下的正则表达式慢10倍。

我也试过做类似(cm|c.{0,1}?m|c.{0,2}?m)的事情，但它没有效果，因为它会匹配任何组模式中的第一个，而不是在所有主题字符串中逐个尝试。

我知道正则表达式（PCRE）有一些黑魔法功能，但我没有找到解决问题的方法。

注意：我现在使用的是非贪婪模式（.{0,2}?）;
问题 Regular expression to find smallest possible match 不是我的问题;

Answer 1

正则表达式可以做很多事情 - 其中一些 - 就像你说的那样 - ＆＃39;黑魔法＆＃39;。但核心问题是 - 从根本上讲，正则表达式是关于文本选择可以捕获的。他们不做＆＃39;匹配比较或评估 - 它们匹配或不匹配。

通过在调试模式下启用它，您可以看到正则表达式正在做什么。为此，我将使用perl，因为您可以设置use re 'debug';＆＃39;：

#!/usr/bin/env perl

use strict;
use warnings;

use re 'debug';

my @matches = "abcemtcmncefmf" =~ m/(cm|c.m|c..m)/;
print join "\n", @matches;

这将打印正则表达式引擎正在执行的操作：

Compiling REx "(cm|c.m|c..m)"
Final program:
   1: OPEN1 (3)
   3:   TRIE-EXACT[c] (19)
        <cm> (19)
        <c> (9)
   9:     REG_ANY (10)
  10:     EXACT <m> (19)
        <c> (15)
  15:     REG_ANY (16)
  16:     REG_ANY (17)
  17:     EXACT <m> (19)
  19: CLOSE1 (21)
  21: END (0)
stclass AHOCORASICK-EXACT[c] minlen 1 
Matching REx "(cm|c.m|c..m)" against "abcemtcmncefmf"
Matching stclass AHOCORASICK-EXACT[c] against "abcemtcmncefmf" (14 bytes)
   0 <> <abcemtcmnc>         | Scanning for legal start char...
   2 <ab> <cemtcmncef>       | Charid:  1 CP:  63 State:    1, word=0 - legal
   3 <abc> <emtcmncefm>      | Charid:  0 CP:  65 State:    2, word=2 - fail
   3 <abc> <emtcmncefm>      | Fail transition to State:    1, word=0 - fail
Matches word #2 at position 2. Trying full pattern...
   2 <ab> <cemtcmncef>       |  1:OPEN1(3)
   2 <ab> <cemtcmncef>       |  3:TRIE-EXACT[c](19)
   2 <ab> <cemtcmncef>       |    State:    1 Accepted: N Charid:  1 CP:  63 After State:    2
   3 <abc> <emtcmncefm>      |    State:    2 Accepted: Y Charid:  0 CP:  65 After State:    0
                                  got 2 possible matches
                                  TRIE matched word #2, continuing
   3 <abc> <emtcmncefm>      |  9:  REG_ANY(10)
   4 <abce> <mtcmncefmf>     | 10:  EXACT <m>(19)
   5 <abcem> <tcmncefmf>     | 19:  CLOSE1(21)
   5 <abcem> <tcmncefmf>     | 21:  END(0)
Match successful!
Freeing REx: "(cm|c.m|c..m)"

希望你能看到它在这里做了什么？

从左到右工作
点击第一个＆＃39;
检查是否＆＃39; cm＆＃39;匹配（失败）
检查是否＆＃c; m＆＃39;匹配（成功）。
在这里纾困并返回命中。

启用g，您可以多次使用它 - 我不会重现它，但它要长得多。

虽然你可以用PCRE做很多巧妙的技巧，比如环顾四周，向前看，贪婪/不同意匹配....从根本上说，在这里，你试图选择多个有效的匹配，并选择最短的。而regex无法做到这一点。

我会提供 - 同样perl，找到最短的过程非常简单：

use List::Util qw/reduce/;
print  reduce { length( $a ) < length( $b ) ? $a : $b } @matches;

Answer 2

您只需在分支重置组中使用替换：

/^(?|.*(cm)|.*(c.m)|.*(c..m))/s

（结果在第1组中）

或者像这样：

/^.*\Kcm|^.*\Kc.m|^.*\Kc..m/s

第一个成功的分支获胜。

Answer 3

从技术上讲，它可以完成。

my ($match) = /
   ^
   (?:(?! c[^m]{0,2}m ).)*+         # Skip past area with no matches.
   (?:
      (?:(?! c[^m]{0,1}m ).)*+      # Skip past area with no matches except longuest.
      (?:
         (?:(?! c[^m]{0,0}m ).)*+   # Skip past area with no matches except 2 longuest.
      )?
   )?
   ( c[^m]{0,2}m )
/xs;

[注意：删除占有量词修饰符（+）会影响性能。]

但是找到所有的比赛并找到最小的比赛通常会好得多。

use List::Util qw( reduce );
my ($match) = reduce { length($a) <= length($b) ? $a : $b } /c[^m]{0,2}m/g;

正则表达式仅匹配最小

3 个答案: