找到所有匹配的子串,而不仅仅是“最扩展的”子串

时间:2012-06-27 14:21:50

标签: java regex substring

代码

String s = "y z a a a b c c z";
Pattern p = Pattern.compile("(a )+(b )+(c *)c");
Matcher m = p.matcher(s);
while (m.find()) {
    System.out.println(m.group());
}

打印

a a a b c c

这是对的。

但从逻辑上讲,子串

a a a b c
a a b c c
a a b c
a b c c
a b c

也匹配正则表达式。

那么,我怎样才能使代码找到那些子字符串,即不仅是最扩展的,还有 children

5 个答案:

答案 0 :(得分:7)

您可以使用*?+?reluctant qualifiers。这些匹配尽可能少,与贪婪的标准*+形成对比,即尽可能匹配。尽管如此,这只允许您找到特定的“子匹配”,而不是全部。使用前瞻控制非捕获组可以实现更多控制,也在文档中描述。但是为了真正找到所有子匹配,你可能必须自己做一些事情,即构建正则表达式对应的自动机,并使用自定义代码进行导航。

答案 1 :(得分:2)

您需要lazy quantifier

请尝试以下方法:

Pattern p = Pattern.compile("(a )+(b )+((c )*?)c");

请注意,我再次将“c”分组,因为我认为这就是你想要的。否则你会发现任意多个空格,但不是“c”。

答案 2 :(得分:0)

我能想到的唯一方法是生成原始字符串的所有可能子字符串的列表,并将正则表达式与每个字符串相匹配,保留匹配的项目。

答案 3 :(得分:0)

我不知道任何可以回馈所有有效匹配的正则表达式引擎。

但我们可以应用一些逻辑来生成所有候选字符串并将其呈现给正则表达式。

通过枚举给定输入的所有可能子字符串来构造候选。

var str = "y z a a a b c c z y z a a a b c c z";
var regex = new Regex("(a )+(b )+(c *)c");

var length = str.Length;

for (int start = 1; start <= length;start++){

    for (int groupLength = 1;  start + groupLength - 1 <= length ;groupLength++){

        var candidate = str.Substring(start-1,groupLength); //.Dump();

        //("\"" + candidate + "\"").Dump();

        var match = regex.Match(candidate);

        if (match.Value == candidate )
        {
            candidate.Dump();
        }

    }
}

这给出了

a a a b c c 
a a b c c 
a b c c 

这似乎是正确答案,但与您的结果相矛盾:

a a a b c => I state that this is not a match
a a b c c ok
a a b c => I state that this is not a match
a b c c ok
a b c => I state that this is not a match

例如,您提供的正则表达式

(a )+(b )+(c *)c

与结果中的第一个条目

不匹配
a a a b c 

如果您认为起始位置不重要,则上述逻辑可以生成相同的匹配。例如,如果您再次重复给定输入:

"y z a a a b c c z y z a a a b c c z"

它会给出:

a a a b c c
a a b c c
a b c c
a a a b c c
a a b c c
a b c c

如果您认为职位不重要,您应该对此结果做出明确的

如果认为输入是空字符串,则应该添加一个简单的情况。如果认为是潜在的匹配。

仅供参考,这是正则表达式检查的所有候选人

"y"
"y "
"y z"
"y z "
"y z a"
"y z a "
"y z a a"
"y z a a "
"y z a a a"
"y z a a a "
"y z a a a b"
"y z a a a b "
"y z a a a b c"
"y z a a a b c "
"y z a a a b c c"
"y z a a a b c c "
"y z a a a b c c z"
" "
" z"
" z "
" z a"
" z a "
" z a a"
" z a a "
" z a a a"
" z a a a "
" z a a a b"
" z a a a b "
" z a a a b c"
" z a a a b c "
" z a a a b c c"
" z a a a b c c "
" z a a a b c c z"
"z"
"z "
"z a"
"z a "
"z a a"
"z a a "
"z a a a"
"z a a a "
"z a a a b"
"z a a a b "
"z a a a b c"
"z a a a b c "
"z a a a b c c"
"z a a a b c c "
"z a a a b c c z"
" "
" a"
" a "
" a a"
" a a "
" a a a"
" a a a "
" a a a b"
" a a a b "
" a a a b c"
" a a a b c "
" a a a b c c"
" a a a b c c "
" a a a b c c z"
"a"
"a "
"a a"
"a a "
"a a a"
"a a a "
"a a a b"
"a a a b "
"a a a b c"
"a a a b c "
"a a a b c c"
"a a a b c c "
"a a a b c c z"
" "
" a"
" a "
" a a"
" a a "
" a a b"
" a a b "
" a a b c"
" a a b c "
" a a b c c"
" a a b c c "
" a a b c c z"
"a"
"a "
"a a"
"a a "
"a a b"
"a a b "
"a a b c"
"a a b c "
"a a b c c"
"a a b c c "
"a a b c c z"
" "
" a"
" a "
" a b"
" a b "
" a b c"
" a b c "
" a b c c"
" a b c c "
" a b c c z"
"a"
"a "
"a b"
"a b "
"a b c"
"a b c "
"a b c c"
"a b c c "
"a b c c z"
" "
" b"
" b "
" b c"
" b c "
" b c c"
" b c c "
" b c c z"
"b"
"b "
"b c"
"b c "
"b c c"
"b c c "
"b c c z"
" "
" c"
" c "
" c c"
" c c "
" c c z"
"c"
"c "
"c c"
"c c "
"c c z"
" "
" c"
" c "
" c z"
"c"
"c "
"c z"
" "
" z"
"z"

同样很高兴知道2种主要类型的正则表达式(NFA和DFA)如何完成工作

来自http://msdn.microsoft.com/en-us/library/e347654k.aspx

  

.NET(我认为也是JAVA)是NFA正则表达式引擎(与DFA相对)   并且当它处理特定语言元素时,引擎使用   贪婪的匹配;也就是说,它匹配输入字符串的数量   可能可以。但它在成功匹配后也会保存其状态   一个子表达式。如果匹配最终失败,引擎可以返回   保存状态,以便可以尝试其他匹配。这个过程   放弃一个成功的子表达式匹配,以便以后的语言   正则表达式中的元素也可以匹配   回溯。

答案 4 :(得分:-1)

鉴于这些非常具体的约束(即这不是一般情况解决方案),这将起作用:

import java.util.Set;
import java.util.TreeSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class test {

    public static void main(String[] args) {

        String s = "y z a a a b c c z";

        Pattern p = Pattern.compile("(a )+(b )+(c ?)+");
        Set<String> set = recurse(s, p, 0);
    }

    public static Set<String> recurse(String s, Pattern p, int depth) {
        int temp = depth;
        while(temp>0) {
            System.out.print("  ");
            temp--;
        }
        System.out.println("-> " +s);

        Matcher matcher = p.matcher(s);
        Set<String> set = new TreeSet<String>();

        if(matcher.find()) {
            String found = matcher.group().trim();
            set.add(found);
            set.addAll(recurse(found.substring(1), p, depth+1));
            set.addAll(recurse(found.substring(0, found.length()-1), p, depth+1));
        }

        while(depth>0) {
            System.out.print("  ");
            depth--;
        }
        System.out.println("<- " +s);
        return set;
    }
}

我有理由相信你可以让它适应其他情况,但是递归到匹配的字符串意味着重叠的匹配(就像@ahenderson指出的那样)将不起作用。