Question

我需要在正则表达式中识别（可能嵌套）捕获组并创建树。特定目标是Java-1.6，我理想地喜欢Java代码。一个简单的例子是：

“（一（B | C）d（E（F * G））H）”

将被解析为

"a(b|c)d(e(f*g))h"
... "b|c"
... "e(f*g)"
     ... "f*g"

理想情况下，解决方案应考虑计数表达式，量词等以及转义级别。但是，如果不容易找到更简单的方法就足够了，因为我们可以限制使用的语法。

修改即可。澄清。我想解析正则表达式字符串本身。为此，我需要知道Java 1.6正则表达式的BNF或等价物。我希望有人已经这样做了。

结果的副产品将是该过程将测试正则表达式的有效性。

Answer 1

考虑加强实际的解析器/词法分析器： http://www.antlr.org/wiki/display/ANTLR3/FAQ+-+Getting+Started

看起来很复杂，但如果你的语言相当简单，那就相当简单了。如果不是，那么在正则表达式中做这件事可能会让你的生活变得地狱：）

Answer 2

我想出了一个使用XML工具（XOM，http://www.xom.nu）来保存树的部分解决方案。首先是代码，然后是一个示例解析。首先将转义的字符（\，（和））解除转义（这里我使用BS，LB和RB），然后将剩余的括号转换为XML标记，然后解析XML并重新转义字符。还需要进一步的是BNF for Java 1.6 regexes doe量词，例如？：，{d，d}等。

public static Element parseRegex(String regex) throws Exception {
    regex = regex.replaceAll("\\\\", "BS");
    regex.replaceAll("BS\\(", "LB");
    regex.replaceAll("BS\\)", "RB");
    regex = regex.replaceAll("\\(", "<bracket>");
    regex.replaceAll("\\)", "</bracket>");
    Element regexX = new Builder().build(new StringReader(
         "<regex>"+regex+"</regex>")).getRootElement();
    extractCaptureGroupContent(regexX);
    return regexX;
}

private static String extractCaptureGroupContent(Element regexX) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < regexX.getChildCount(); i++) {
        Node childNode = regexX.getChild(i);
        if (childNode instanceof Text) {
            Text t = (Text)childNode;
            String s = t.getValue();
            s = s.replaceAll("BS", "\\\\").replaceAll("LB", 
                        "\\(").replaceAll("RB", "\\)");
            t.setValue(s);
            sb.append(s);
        } else {
            sb.append("("+extractCaptureGroupContent((Element)childNode)+")");
        }
    }
    String capture = sb.toString();
    regexX.addAttribute(new Attribute("capture", capture));
    return capture;
}

示例：

@Test
public void testParseRegex2() throws Exception {
    String regex = "(.*(\\(b\\))c(d(e)))";
    Element regexElement = ParserUtil.parseRegex(regex);
    CMLUtil.debug(regexElement, "x");
}

给出：

<regex capture="(.*((b))c(d(e)))">
  <bracket capture=".*((b))c(d(e))">.*
    <bracket capture="(b)">(b)</bracket>c
    <bracket capture="d(e)">d
      <bracket capture="e">e</bracket>
    </bracket>
  </bracket>
</regex>

用于将正则表达式中的捕获组解析为树的代码

2 个答案: