将Matcher.appendReplacement()与多个区域一起使用

时间:2015-08-26 17:42:45

标签: java

java Matcher.appendReplacement()方法(带有appendTail())可以让我将源文本转换为结果文本,同时替换所有出现的模式。 伪语言中的算法类似于:

call Matcher.region()
while Matcher.find() {
  call Matcher.appendReplacement()
}
call Matcher.appendTail()

如果仅在给定区域内搜索模式,则一切正常:

call Matcher.region()
while Matcher.find() {
  call Matcher.appendReplacement()
}
call Matcher.region()
while Matcher.find() {
  call Matcher.appendReplacement()
}
call Matcher.appendTail()

当在区域内进行匹配后,我想进一步移动该区域时出现问题:

package test;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TestMatcher {

    public static void main(String[] args) throws Exception {
        String inputText = "dog1 start dog2a dog2b end dog3 start dog4a dog4b end dog5";
        System.out.println("input  = " + inputText);
        StringBuffer result = new StringBuffer();
        Pattern pattern = Pattern.compile("dog");
        Matcher matcher = pattern.matcher(inputText);

        int startPos = inputText.indexOf("start");
        int endPos = inputText.indexOf("end");
        System.out.println("Setting region to " + startPos + "," + endPos);
        matcher.region(startPos, endPos);
        while (matcher.find()) {
            matcher.appendReplacement(result, "cat");
        }
        System.out.println("Partial result = " + result);

        startPos = inputText.indexOf("start", endPos);
        endPos = inputText.indexOf("end", startPos);
        System.out.println("Setting region to " + startPos + "," + endPos);
        matcher.region(startPos, endPos);
        while (matcher.find()) {
            matcher.appendReplacement(result, "cat");
        }
        matcher.appendTail(result);
        System.out.println("Final result   = " + result);
    }
}

这不起作用,因为region()重置匹配器,以便Matcher.appendReplacement()从文本的开头重新开始,导致结果包含源的某些部分的重复。

这是设计发生的,正如javadoc所说。

更换可位于多个区域内的模式的正确方法是什么?

编辑:添加了java示例,删除了文本示例

以下java示例显示来自

之类的输入

dog1 start dog2a dog2b end dog3 start dog4a dog4b end dog5

你没有得到预期的输出

dog1启动cat2a cat2b结束dog3启动cat4a cat4b结束dog5

input  = dog1 start dog2a dog2b end dog3 start dog4a dog4b end dog5
Setting region to 5,23
Partial result = dog1 start cat2a cat
Setting region to 32,50
Final result   = dog1 start cat2a catdog1 start dog2a dog2b end dog3 start cat4a cat4b end dog5

输出:

{{1}}

1 个答案:

答案 0 :(得分:1)

子区域是否必须由单独的匹配器处理?像:

public static void main(String[] args) {
  String inputText = "dog1 start dog2a dog2b end dog3 start dog4a dog4b end dog5";

  System.out.println("Input          = " + inputText);
  StringBuffer result = new StringBuffer();
  Pattern pattern = Pattern.compile("(start(.*?)end)");

  Matcher matcher = pattern.matcher(inputText);

  while (matcher.find()) {
    int s = matcher.start();
    int e = matcher.end();
    System.out.printf("(%d .. %d) -> \"%s\"\n", s, e, matcher.group(1));
    matcher.appendReplacement(result, processSubGroup(matcher.group(1), matcher.group(2)));
  }
  matcher.appendTail(result);
  System.out.println("Final result   = " + result);
}

static String processSubGroup(String subGroup, String contents) {
  StringBuffer result = new StringBuffer();
  Pattern pattern = Pattern.compile("dog");

  Matcher matcher = pattern.matcher(subGroup);

  while (matcher.find())
    matcher.appendReplacement(result, "cat");

  matcher.appendTail(result);
  return result.toString();
}

或者,没有与日志相关的东西,更简单:

public static void main(String[] args) {
  String inputText = "dog1 start dog2a dog2b end dog3 start dog4a dog4b end dog5";

  StringBuffer result = new StringBuffer();
  Pattern pattern = Pattern.compile("(start(.*?)end)");

  Matcher matcher = pattern.matcher(inputText);

  while (matcher.find())
    matcher.appendReplacement(result, processSubGroup(matcher.group(1), matcher.group(2)));

  matcher.appendTail(result);
  System.out.println("Final result   = " + result);
}

static String processSubGroup(String subGroup, String contents) {
  return Pattern.compile("dog").matcher(subGroup).replaceAll("cat");
}

结果:

Input          = dog1 start dog2a dog2b end dog3 start dog4a dog4b end dog5
(5 .. 26) -> "start dog2a dog2b end"
(32 .. 53) -> "start dog4a dog4b end"
Final result   = dog1 start cat2a cat2b end dog3 start cat4a cat4b end dog5

或更抽象的方法:

interface GroupProcessor {
  String process(String group);
}

public static void main(String[] args) {
  String inputText = "dog1 dogs dog2a dog2b enddogs cow1 dog3 cows cow2a cow2b endcows dog4 dogs dog5a dog5b enddogs cow3";

  String result = inputText;

  result = processGroup(result, "dogs*enddogs", (group) -> {
    return Pattern.compile("dog").matcher(group).replaceAll("cat");
  });

  result = processGroup(result, "cows*endcows", (group) -> {
    return Pattern.compile("cow").matcher(group).replaceAll("sheep");
  });

  System.out.println("Input        = " + inputText);
  System.out.println("Final result = " + result);
}

static String processGroup(String input, String regex, GroupProcessor processor) {
  StringBuffer result = new StringBuffer();
  Pattern pattern = Pattern.compile(String.format("(%s)", regex.replace("*", "(.*?)")));

  Matcher matcher = pattern.matcher(input);

  while (matcher.find())
    matcher.appendReplacement(result, processor.process(matcher.group(1)));

  matcher.appendTail(result);
  return result.toString();
}

哪会给我们:

Input        = dog1 dogs dog2a dog2b enddogs cow1 dog3 cows cow2a cow2b endcows dog4 dogs dog5a dog5b enddogs cow3
Final result = dog1 cats cat2a cat2b endcats cow1 dog3 sheeps sheep2a sheep2b endsheeps dog4 cats cat5a cat5b endcats cow3

<强> UPD。

原因,为什么Matcher.region()重置隐式匹配器状态,因此lastAppendPosition

appendReplacementappendTail在某种程度上是一种向前移动的机制,而.region()则不是那么具有确定性。

假设以下情况:对于100个字符的字符串,您应用了区域0..20,执行了find() - appendReplacement()循环,然后将区域移动到fe,30..60,并执行了替换循环试。

现在,StringBuffer中有0..100源字符串和f.e.,0..60替换结果字符串。

接下来,您将区域10..40应用于源字符串......以及下一步是什么?如果源字符串的那个区域不包含匹配项 - 好的,什么都不做,但是如果 包含匹配项? appendReplacement应该在哪里附加/插入替换结果?结果字符串已超过10..40区域,appendReplacement追加,而不是替换输出缓冲区中字符串的分区。

如果存在一些约束机制,那个有限区域只设置为MAX(start, lastAppendPosition)..MIN(end, sourceLength),那么ok,append机制可以正常工作,但.region()方法没有这样的限制,或者它们(这些局限性) )会使.region()方法对搜索毫无用处( 是<{1}}方法的主要目的)。

这就是为什么.region()重置了隐含的匹配状态,使其与.region()相关的东西不那么有用。如果您需要不同的行为 - 通过封装扩展appendReplacement()类。