我一直认为Java的regex-API(以及许多其他语言)中的 look-behind 断言必须具有明显的长度。因此, look-behinds 中不允许使用STAR和PLUS量词。
优秀的在线资源regular-expressions.info似乎证实了(某些)我的假设:
“[...] Java更进一步 允许有限重复。你还在 不能使用明星或加号,但你 可以使用问号和 带有max参数的花括号 指定。 Java认识到了这一事实 那有限的重复可以 重写为字符串的替换 具有不同但固定的长度。 不幸的是,JDK 1.4和1.5 你使用时会有一些错误 里面的交替变换。这些 在JDK 1.6中修复。 [...]“
只要前瞻中的字符范围的总长度小于或等于Integer.MAX_VALUE,就使用大括号。所以这些正则表达式是有效的:
"(?<=a{0," +(Integer.MAX_VALUE) + "})B"
"(?<=Ca{0," +(Integer.MAX_VALUE-1) + "})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-2) + "})B"
但这些不是:
"(?<=Ca{0," +(Integer.MAX_VALUE) +"})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-1) +"})B"
但是,我不明白以下内容:
当我使用 look-behind 中的*和+量词进行测试时,一切顺利(参见输出测试1 和测试2 )。
但是,当我在测试1 和测试2 的 look-behind 的开头添加一个字符时,它会中断(见输出测试3 )。
从测试3 不情愿的贪婪*没有效果,它仍然会中断(参见测试4 )。
这是测试工具:
public class Main {
private static String testFind(String regex, String input) {
try {
boolean returned = java.util.regex.Pattern.compile(regex).matcher(input).find();
return "testFind : Valid -> regex = "+regex+", input = "+input+", returned = "+returned;
} catch(Exception e) {
return "testFind : Invalid -> "+regex+", "+e.getMessage();
}
}
private static String testReplaceAll(String regex, String input) {
try {
String returned = input.replaceAll(regex, "FOO");
return "testReplaceAll : Valid -> regex = "+regex+", input = "+input+", returned = "+returned;
} catch(Exception e) {
return "testReplaceAll : Invalid -> "+regex+", "+e.getMessage();
}
}
private static String testSplit(String regex, String input) {
try {
String[] returned = input.split(regex);
return "testSplit : Valid -> regex = "+regex+", input = "+input+", returned = "+java.util.Arrays.toString(returned);
} catch(Exception e) {
return "testSplit : Invalid -> "+regex+", "+e.getMessage();
}
}
public static void main(String[] args) {
String[] regexes = {"(?<=a*)B", "(?<=a+)B", "(?<=Ca*)B", "(?<=Ca*?)B"};
String input = "CaaaaaaaaaaaaaaaBaaaa";
int test = 0;
for(String regex : regexes) {
test++;
System.out.println("********************** Test "+test+" **********************");
System.out.println(" "+testFind(regex, input));
System.out.println(" "+testReplaceAll(regex, input));
System.out.println(" "+testSplit(regex, input));
System.out.println();
}
}
}
输出:
********************** Test 1 **********************
testFind : Valid -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
testReplaceAll : Valid -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
testSplit : Valid -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]
********************** Test 2 **********************
testFind : Valid -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
testReplaceAll : Valid -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
testSplit : Valid -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]
********************** Test 3 **********************
testFind : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
^
testReplaceAll : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
^
testSplit : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
^
********************** Test 4 **********************
testFind : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
^
testReplaceAll : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
^
testSplit : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
^
我的问题可能很明显,但我仍然会问:任何人都可以向我解释为什么测试1 和 2 失败,测试3 和 4 不是吗?我本以为他们都会失败,不是一半人工作而一半人失败。
感谢。
PS。我正在使用:Java版本1.6.0_14
答案 0 :(得分:17)
浏览Pattern.java的源代码会发现'*'和'+'是作为Curly实例实现的(它是为卷曲运算符创建的对象)。所以,
a*
实现为
a{0,0x7FFFFFFF}
和
a+
实现为
a{1,0x7FFFFFFF}
这就是为什么你会看到完全相同的曲线和星星行为。
答案 1 :(得分:13)
这是一个错误:http://bugs.sun.com/view_bug.do?bug_id=6695369
Pattern.compile()
总是应该抛出一个异常,如果它无法确定一个后视匹配的最大可能长度。