Question

我有一组关键字。任何关键字都可以包含空格符号['one', 'one two']。我从这些kyewords生成一个正则表达式，如/\b(?i:one|one\ two|three)\b/。完整示例如下：

keywords = ['one', 'one two', 'three']
re = /\b(?i:#{ Regexp.union(keywords).source })\b/
text = 'Some word one and one two other word'
text.downcase.scan(re)

此代码的结果是

=> ["one", "one"]

如何找到第二个关键字one two的匹配并得到这样的结果？

=> ["one", "one two"]

Answer 1

关键是\bone\b与one中的one two匹配，并且由于此分支出现在one two分支之前，因此它“获胜”（请参阅Remember That The Regex Engine Is Eager）。

在构建正则表达式之前，您需要按降序对关键字数组进行排序。它看起来像

(?-mix:\b(?i:three|one\ two|one)\b)

这样，较长的one two会在较短的one之前显示并且会匹配。

请参阅Ruby demo：

keywords = ['one', 'one two', 'three']
keywords = keywords.dup.sort.reverse
re = /\b(?i:#{ Regexp.union(keywords).source })\b/
text = 'Some word one and one two other word'
puts text.downcase.scan(re)
# => [ one, one two ]

Answer 2

正则表达式渴望匹配。一旦找到匹配，他们就不会试图找到另一个可能更长的匹配（有一个重要的例外）。

/\b(?i:one|one\ two|three)\b/永远不会与one two匹配，因为它始终与one匹配。您需要/\b(?i:one two|one|three)\b/才能首先尝试one two。可能最简单的自动化方法是先按最长的关键字排序。

keywords = ['one', 'one two', 'three']
re = Regexp.union(keywords.sort { |a,b| b.length <=> a.length }).source
re = /\b#{re}\b/i;
text = 'Some word one and one two other word'
puts text.scan(re)

请注意，我将整个正则表达式设置为不区分大小写，比(?:...)更容易阅读，并且对字符串进行降级是多余的。

例外情况为repetition，例如+，*和朋友。默认情况下，它们是贪心。 .+将匹配尽可能多的字符。那太贪心了。你可以使它变得懒惰，以匹配它看到的第一件事，?。 .+?将匹配单个字符。

"A foot of fools".match(/(.*foo)/);  # matches "A foot of foo"
"A foot of fools".match(/(.*?foo)/);  # matches "A foo"

Answer 3

我通过将第一个元素移动到数组的第二个位置来尝试你的例子并且它可以工作（例如http://rubular.com/r/4F2Hc46wHT）。

事实上，它看起来像第一个关键字“重叠”第二个。

如果您无法更改关键字顺序，此回复可能无效。

正则表达式只返回一个匹配项

3 个答案: