在JAVA中使用反向引用的递归组捕获正则表达式

时间:2015-08-17 02:33:07

标签: java regex backreference capture-group recursive-regex

我试图在字符串中递归捕获多个组,同时使用对正则表达式中的组的反向引用。即使我使用Pattern和Matcher以及“while(matcher.find())”循环,它仍然只捕获最后一个实例而不是所有实例。在我的情况下,唯一可能的标签是< sm>,< po>,< pof>,< pos>,< poi>,< pol>,< poif>,< poil>。由于这些是格式化标签,我需要捕获:

  1. 标签之外的任何文本(这样我就可以将其格式化为“普通”文本,我通过在一个组中的标签之前捕获任何文本而我在另一个组中捕获标签本身来进行此操作,并且我遍历这些事件我删除了从原始字符串中捕获的所有内容;如果我最后留下任何文本,我将其格式化为“普通”文本)
  2. 标签的“名称”,以便我知道我将如何拥有 格式化标签内的文字
  3. 将根据标记名称及其关联规则进行格式化的标记的文本内容
  4. 以下是我的示例代码:

            String currentText = "the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’</po><poil>for out of man this one has been taken.”</poil>";
            String remainingText = currentText;
    
            //first check if our string even has any kind of xml tag, because if not we will just format the whole string as "normal" text
            if(currentText.matches("(?su).*<[/]{0,1}(?:sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1}>.*"))
            {                
                //an opening or closing tag has been found, so let us start our pattern captures
                //I am using a backreference \\2 to make sure the closing tag is the same as the opening tag
                Pattern pattern1 = Pattern.compile("(.*)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",Pattern.UNICODE_CHARACTER_CLASS);
                Matcher matcher1 = pattern1.matcher(currentText);                
                int iteration = 0;
                while(matcher1.find()){
                    System.out.print("Iteration ");
                    System.out.println(++iteration);
                    System.out.println("group1:"+matcher1.group(1));
                    System.out.println("group2:"+matcher1.group(2));
                    System.out.println("group3:"+matcher1.group(3));
                    System.out.println("group4:"+matcher1.group(4));
    
                    if(matcher1.group(1) != null && matcher1.group(1).isEmpty() == false)
                    {
                        m_xText.insertString(xTextRange, matcher1.group(1), false);
                        remainingText = remainingText.replaceFirst(matcher1.group(1), "");
                    }
                    if(matcher1.group(4) != null && matcher1.group(4).isEmpty() == false)
                    {
                        switch (matcher1.group(2)) {
                            case "pof": [...]
                            case "pos": [...]
                            case "poif": [...]
                            case "po": [...]
                            case "poi": [...]
                            case "pol": [...]
                            case "poil": [...]
                            case "sm": [...]
                        }
                        remainingText = remainingText.replaceFirst("<"+matcher1.group(2)+">"+matcher1.group(4)+"</"+matcher1.group(2)+">", "");
                    }
                }
    

    System.out.println仅在我的控制台中输出一次,结果如下:

    Iteration 1:
      group1:the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’</po>; 
      group2:poil
      group3:po
      group4:for out of man this one has been taken.”
    

    第3组被忽略,唯一有用的组是1,2和4(第3组是第2组的一部分)。为什么这只捕获最后一个标签实例“poil”,而它没有捕获前面的“pof”,“poi”和“po”标签?

    我希望看到的输出是这样的:

    Iteration 1:
      group1:the man said:
      group2:pof
      group3:po
      group4:“This one, at last, is bone of my bones
    
    Iteration 2:
      group1:
      group2:poi
      group3:po
      group4:and flesh of my flesh;
    
    Iteration 3:
      group1:
      group2:po
      group3:po
      group4:This one shall be called ‘woman,’
    
    Iteration 3:
      group1:
      group2:poil
      group3:po
      group4:for out of man this one has been taken.”
    

1 个答案:

答案 0 :(得分:1)

我刚刚找到了这个问题的答案,它只需要在第一次捕获中使用非贪婪的量词,就像我在第四个捕获组中一样。这完全符合要求:

Pattern pattern1 = Pattern.compile("(.*?)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",Pattern.UNICODE_CHARACTER_CLASS);