正则表达式在Java中没有明显的最大长度

时间:2009-10-08 10:22:56

标签: java regex

我一直认为Java的regex-API(以及许多其他语言)中的 look-behind 断言必须具有明显的长度。因此, look-behinds 中不允许使用STAR和PLUS量词。

优秀的在线资源regular-expressions.info似乎证实了(某些)我的假设:

  

“[...] Java更进一步   允许有限重复。你还在   不能使用明星或加号,但你   可以使用问号和   带有max参数的花括号   指定。 Java认识到了这一事实   那有限的重复可以   重写为字符串的替换   具有不同但固定的长度。   不幸的是,JDK 1.4和1.5   你使用时会有一些错误   里面的交替变换。这些   在JDK 1.6中修复。 [...]“

     

- http://www.regular-expressions.info/lookaround.html

只要前瞻中的字符范围的总长度小于或等于Integer.MAX_VALUE,就使用大括号。所以这些正则表达式是有效的:

"(?<=a{0,"   +(Integer.MAX_VALUE)   + "})B"
"(?<=Ca{0,"  +(Integer.MAX_VALUE-1) + "})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-2) + "})B"

但这些不是:

"(?<=Ca{0,"  +(Integer.MAX_VALUE)   +"})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-1) +"})B"

但是,我不明白以下内容:

当我使用 look-behind 中的*和+量词进行测试时,一切顺利(参见输出测试1 测试2 )。

但是,当我在测试1 测试2 look-behind 的开头添加一个字符时,它会中断(见输出测试3 )。

测试3 不情愿的贪婪*没有效果,它仍然会中断(参见测试4 )。

这是测试工具:

public class Main {

    private static String testFind(String regex, String input) {
        try {
            boolean returned = java.util.regex.Pattern.compile(regex).matcher(input).find();
            return "testFind       : Valid   -> regex = "+regex+", input = "+input+", returned = "+returned;
        } catch(Exception e) {
            return "testFind       : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    private static String testReplaceAll(String regex, String input) {
        try {
            String returned = input.replaceAll(regex, "FOO");
            return "testReplaceAll : Valid   -> regex = "+regex+", input = "+input+", returned = "+returned;
        } catch(Exception e) {
            return "testReplaceAll : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    private static String testSplit(String regex, String input) {
        try {
            String[] returned = input.split(regex);
            return "testSplit      : Valid   -> regex = "+regex+", input = "+input+", returned = "+java.util.Arrays.toString(returned);
        } catch(Exception e) {
            return "testSplit      : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    public static void main(String[] args) {
        String[] regexes = {"(?<=a*)B", "(?<=a+)B", "(?<=Ca*)B", "(?<=Ca*?)B"};
        String input = "CaaaaaaaaaaaaaaaBaaaa";
        int test = 0;
        for(String regex : regexes) {
            test++;
            System.out.println("********************** Test "+test+" **********************");
            System.out.println("    "+testFind(regex, input));
            System.out.println("    "+testReplaceAll(regex, input));
            System.out.println("    "+testSplit(regex, input));
            System.out.println();
        }
    }
}

输出:

********************** Test 1 **********************
    testFind       : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
    testReplaceAll : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
    testSplit      : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]

********************** Test 2 **********************
    testFind       : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
    testReplaceAll : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
    testSplit      : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]

********************** Test 3 **********************
    testFind       : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^
    testReplaceAll : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^
    testSplit      : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^

********************** Test 4 **********************
    testFind       : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^
    testReplaceAll : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^
    testSplit      : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^

我的问题可能很明显,但我仍然会问:任何人都可以向我解释为什么测试1 2 失败,测试3 4 不是吗?我本以为他们都会失败,不是一半人工作而一半人失败。

感谢。

PS。我正在使用:Java版本1.6.0_14

2 个答案:

答案 0 :(得分:17)

浏览Pattern.java的源代码会发现'*'和'+'是作为Curly实例实现的(它是为卷曲运算符创建的对象)。所以,

a*

实现为

a{0,0x7FFFFFFF}

a+

实现为

a{1,0x7FFFFFFF}

这就是为什么你会看到完全相同的曲线和星星行为。

答案 1 :(得分:13)

这是一个错误:http://bugs.sun.com/view_bug.do?bug_id=6695369

Pattern.compile()总是应该抛出一个异常,如果它无法确定一个后视匹配的最大可能长度。