正则表达式在Java中落空

时间:2014-05-30 07:32:42

标签: java regex expression

我在使用Java开发正则表达式模式以分割大学时间表类字符串时遇到了麻烦。

示例字符串是这样的:

"CIVL4401_SEM-1:Laboratory_Lab1: 05:11:Engineering - Civil & Mechanical: Soils Lab (G99): [Pref 1] (cont) CIVL4401_SEM-1:Laboratory_Lab2: 07:19:Engineering - Civil & Mechanical: Soils Lab (G99): [Pref 1] (cont) "

(全是一行)

使用正则表达式模式:

final String classregex = "(?<=\\(cont\\)\\s|\\[Pref \\d{1,2}\\]\\s)";

它应该分成两个类条目:

"CIVL4401_SEM-1:Laboratory_Lab1: 05:11:Engineering - Civil & Mechanical: Soils Lab (G99): [Pref 1] (cont) "
"CIVL4401_SEM-1:Laboratory_Lab2: 07:19:Engineering - Civil & Mechanical: Soils Lab (G99): [Pref 1] (cont) "

背后的零长度外观是有意的;我想保留所有数据。

相反,我得到:

"CIVL4401_SEM-1:Laboratory_Lab1: 05:11:Engineering - Civil & Mechanical: Soils Lab (G99): [Pref 1] "
"(cont) "
"CIVL4401_SEM-1:Laboratory_Lab2: 07:19:Engineering - Civil & Mechanical: Soils Lab (G99): [Pref 1] "
"(cont) "

我非常确定我明白为什么会这样 - 它与&#34; [Pref d]&#34;匹配首先,提取出来,然后经过其余的,找到&#34;(续)&#34;之后立刻等等。

请注意,还有时间表类没有&#34;(续)&#34;在他们中间,这就是为什么有一个&#34; [Pref d]&#34;参与正则表达式。

有没有办法订购Java正则表达式引擎的工作方式?我希望它能够尝试匹配&#34;(续)&#34;首先尝试匹配&#34; [Pref d]&#34;部分。我的猜测是,必须有一个复杂的前瞻,并在表达背后,我不知道该怎么做。

如果不能这样做,那么我将编写一个修复函数来处理这个问题。谢谢。

2 个答案:

答案 0 :(得分:1)

这个怎么样:

(?<=\(cont\)\s|\[Pref\s\d\]\s(?!\(cont\)))

另外会检查[Pref \d]后面没有(cont)

在Java世界中:

(?<=\\(cont\\)\\s|\\[Pref\\s\\d\\]\\s(?!\\(cont\\)))

但我很惊讶地发现即使这样也有效

(?<=\\(cont\\)\\s|\\[Pref\\s\\d{1,2}\\]\\s(?!\\(cont\\)))

正如OP在评论中提到的那样,Java似乎支持外观中的有限范围量词。以下是regular-expressions.info的摘录:

Java takes things a step further by allowing finite repetition. You still cannot use the star or plus, but you can use the question mark and the curly braces with the max parameter specified. Java determines the minimum and maximum possible lengths of the lookbehind. The lookbehind in the regex (?<!ab{2,4}c{3,5}d)test has 6 possible lengths. It can be between 7 to 11 characters long. When Java (version 6 or later) tries to match the lookbehind, it first steps back the minimum number of characters (7 in this example) in the string and then evaluates the regex inside the lookbehind as usual, from left to right. If it fails, Java steps back one more character and tries again. If the lookbehind continues to fail, Java continues to step back until the lookbehind either matches or it has stepped back the maximum number of characters (11 in this example). This repeated stepping back through the subject string kills performance when the number of possible lengths of the lookbehind grows. Keep this in mind. Don't choose an arbitrarily large maximum number of repetitions to work around the lack of infinite quantifiers inside lookbehind. Java 4 and 5 have bugs that cause lookbehind with alternation or variable quantifiers to fail when it should succeed in some situations. These bugs were fixed in Java 6.

答案 1 :(得分:0)

您是否尝试过使用'?'有条件的是http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1

我相信添加(\(续))?会工作的。