Question

我正在寻找一种方法，无论是在Ruby还是Javascript中，它都会在字符串中为正则表达式提供所有匹配，可能重叠。

假设我有str = "abcadc"，我希望找到a后跟任意数量的字符，然后是c。我正在寻找的结果是["abc", "adc", "abcadc"]。关于如何实现这一目标的任何想法？

str.scan(/a.*c/)会给我["abcadc"]，str.scan(/(?=(a.*c))/).flatten会给我["abcadc", "adc"]。

Answer 1

在Ruby中，您可以使用以下方法获得预期的结果：

str = "abcadc"
[/(a[^c]*c)/, /(a.*c)/].flat_map{ |pattern| str.scan(pattern) }.reduce(:+)
# => ["abc", "adc", "abcadc"]

这种方式是否适合您，高度依赖于您真正想要实现的目标。

我试着把它放到一个单独的表达式中，但我无法使它工作。我真的想知道是否存在一些科学原因，这些原因无法通过正则表达式进行解析，或者我是否对Ruby的解析器Oniguruma不太了解。

Answer 2

def matching_substrings(string, regex)
  string.size.times.each_with_object([]) do |start_index, maching_substrings|
    start_index.upto(string.size.pred) do |end_index|
      substring = string[start_index..end_index]
      maching_substrings.push(substring) if substring =~ /^#{regex}$/
    end
  end
end

matching_substrings('abcadc', /a.*c/) # => ["abc", "abcadc", "adc"]
matching_substrings('foobarfoo', /(\w+).*\1/) 
  # => ["foobarf",
  #     "foobarfo",
  #     "foobarfoo",
  #     "oo",
  #     "oobarfo",
  #     "oobarfoo",
  #     "obarfo",
  #     "obarfoo",
  #     "oo"]
matching_substrings('why is this downvoted?', /why.*/)
  # => ["why",
  #     "why ",
  #     "why i",
  #     "why is",
  #     "why is ",
  #     "why is t",
  #     "why is th",
  #     "why is thi",
  #     "why is this",
  #     "why is this ",
  #     "why is this d",
  #     "why is this do",
  #     "why is this dow",
  #     "why is this down",
  #     "why is this downv",
  #     "why is this downvo",
  #     "why is this downvot",
  #     "why is this downvote",
  #     "why is this downvoted",
  #     "why is this downvoted?"]

Answer 3

您想要所有可能的匹配，包括重叠匹配。正如您所指出的那样，“How to find overlapping matches with a regexp?”的先行技巧对您的情况不起作用。

在一般情况下，我唯一能想到的就是生成字符串的所有可能的子字符串，并根据正则表达式的锚定版本检查每个字符串。这是一种蛮力，但它确实有效。

红宝石：

def all_matches(str, regex)
  (n = str.length).times.reduce([]) do |subs, i|
     subs += [*i..n].map { |j| str[i,j-i] }
  end.uniq.grep /^#{regex}$/
end

all_matches("abcadc", /a.*c/) 
#=> ["abc", "abcadc", "adc"]

使用Javascript：

function allMatches(str, regex) {
  var i, j, len = str.length, subs={};
  var anchored = new RegExp('^' + regex.source + '$');
  for (i=0; i<len; ++i) {
    for (j=i; j<=len; ++j) {
       subs[str.slice(i,j)] = true;
    }
  }
  return Object.keys(subs).filter(function(s) { return s.match(anchored); });
}

Answer 4

在JS中：

＆＃13;

function doit(r, s) {
  var res = [], cur;
  r = RegExp('^(?:' + r.source + ')$', r.toString().replace(/^[\s\S]*\/(\w*)$/, '$1'));
  r.global = false;
  for (var q = 0; q < s.length; ++q)
    for (var w = q; w <= s.length; ++w)
      if (r.test(cur = s.substring(q, w)))
        res.push(cur);
  return res;
}
document.body.innerHTML += "<pre>" + JSON.stringify(doit( /a.*c/g, 'abcadc' ), 0, 4) + "</pre>";

＆＃13;

Answer 5

▶ str = "abcadc"
▶ from = str.split(/(?=\p{L})/).map.with_index { |c, i| i if c == 'a' }.compact
▶ to   = str.split(/(?=\p{L})/).map.with_index { |c, i| i if c == 'c' }.compact
▶ from.product(to).select { |f,t| f < t }.map { |f,t| str[f..t] }
#⇒ [
#  [0] "abc",
#  [1] "abcadc",
#  [2] "adc"
# ]

我相信，有一种奇特的方法来查找字符串中字符的所有索引，但我无法找到它:( 有什么想法吗？

拆分“unicode char boundary”可以使用'ábĉ'或'Üve Østergaard'等字符串。

对于更通用的解决方案，它接受任何“from”和“to”序列，应该只引入一点修改：在字符串中查找“from”和“to”的所有索引。

Answer 6

这里的方法类似于@ ndn＆＃39; s和@ Mark，适用于任何字符串和正则表达式。我已将此实现为String的方法，因为我希望看到它。不会对String#[]和String#scan表示赞赏吗？

class String
  def all_matches(regex)
    return [] if empty?
    r = /^#{regex}$/
    1.upto(size).with_object([]) { |i,a|
      a.concat(each_char.each_cons(i).map(&:join).select { |s| s =~ r }) }
  end
end

'abcadc'.all_matches /a.*c/
  # => ["abc", "abcadc", "adc"]
'aaabaaa'.all_matches(/a.*a/)
  #=> ["aa", "aa", "aa", "aa", "aaa", "aba", "aaa", "aaba", "abaa", "aaaba",
  #    "aabaa", "abaaa", "aaabaa", "aabaaa", "aaabaaa"]

Answer 7

RegExp /(a.c)|(a.*c)/g的方法是匹配"a"字符，后跟"c"后跟的任何字符; "a.*c"匹配"a"后跟任何字符，后跟前一个字符后跟"c"字符; RegExp处的(a.*c)注释可能会得到改善。条件if检查输入字符串中的最后一个字符是否为"c"，如果true，则将完整输入字符串推送到res结果数组

＆＃13;

var str = "abcadc"
, res = str.match(/(a.c)|(a.*c)/g); 
if (str[str.length - 1] === "c") res.push(str);

document.body.textContent = res.join(" ")

＆＃13;

Answer 8

这种JavaScript方法比Wiktor's answer更具优势，因为它通过{em> lazily 使用generator function迭代给定字符串的子字符串，使您一次可以消耗一个匹配项对于使用for...of循环的非常大的输入字符串，而不是一次生成整个匹配数组，这可能会导致内存不足异常，因为字符串的子字符串数量随长度呈二次方增长：

function * substrings (str) {
  for (let length = 1; length <= str.length; length++) {
    for (let i = 0; i <= str.length - length; i++) {
      yield str.slice(i, i + length);
    }
  }
}

function * matchSubstrings (str, re) {
  const subre = new RegExp(`^${re.source}$`, re.flags);
  
  for (const substr of substrings(str)) {
    if (subre.test(substr)) yield substr;
  }
}

for (const match of matchSubstrings('abcabc', /a.*c/)) {
  console.log(match);
}

如何在字符串中获得可能重叠的匹配

8 个答案: