Question

我最近正在研究一个问题，以删除字符串中的重复单词，即“我很好”变为“我很好”。但是我注意到了一件奇怪的事情，正则表达式适用于所有情况，除了一种情况，我不明白为什么。

这是我的代码：

        String regex = "\\b(\\w+)(\\s+\\1\\b)+";
        Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

        String input = "INPUT";

        Matcher m = p.matcher(input);

        // Check for subsequences of input that match the compiled pattern
        while (m.find()) {
            input = input.replaceAll(m.group(), m.group(1));
        }

        // Prints the modified sentence.
        System.out.println(input);

现在输入一次输入：

我是个好人，我是凌晨2点，上午是1点

输出：

我是个好人，我是凌晨2点，上午是1点

仍然有两个重复的“ am”。现在，如果INPUT是：

我是上午2点，上午1点是好人

输出：

我是一个好人，我是凌晨2点

不重复“上午”

我不知道为什么会这样，有人可以帮忙吗？

Answer 1

您太想了。

所有代码都可以替换为：

System.out.println(input.replaceAll("(?i)\\b(\\w+)(\\s+\\1\\b)+", "$1"));

将匹配的文本替换为捕获组1。

无论如何，这是最佳解决方案。由于您似乎想解释为什么为什么失败，因此代码为：

如果您已调试代码，则失败的原因显而易见。

在代码中添加3条打印语句会显示此问题：

while (m.find()) {
    System.out.printf("group() = \"%s\", group(1) = \"%s\"%n", m.group(), m.group(1));
    System.out.printf("  input (before) = \"%s\"%n", input);
    input = input.replaceAll(m.group(), m.group(1));
    System.out.printf("  input (after) = \"%s\"%n", input);
}

输出

group() = "am am", group(1) = "am"
  input (before) = "i am am 2 am am am 1 am a good man"
  input (after) = "i am 2 am am 1 am a good man"
group() = "am am am", group(1) = "am"
  input (before) = "i am 2 am am 1 am a good man"
  input (after) = "i am 2 am am 1 am a good man"

如您所见，问题在于第二个匹配仍然与原始输入匹配，匹配am am am，但是对replaceAll()的第一次调用已从那些am。

一种修改代码（使其尽可能与自己的代码尽可能接近）的方法是调用replaceFirst()而不是replaceAll()。您还应该引用这些值，因为这两种方法均以正则表达式为参数。

while (m.find()) {
    input = input.replaceFirst(Pattern.quote(m.group()), Matcher.quoteReplacement(m.group(1)));
}

Answer 2

不是那样。

一方面，当您执行Matcher m = p.matcher(input);时，该Matcher会应用于input对象，它是一个不可变的字符串。

您可能会认为，当您重新分配它时，您正在更改它

input = input.replaceAll(m.group(), m.group(1));

但是，不，您只是在使input变量引用新的String。但是匹配器仍然使用旧字符串进行操作。

要对此进行测试，请添加一条调试行，并替换为更改后的字符串：

   while (m.find()) {
      System.out.println("input=[" + input +"] group=[" + m.group()  +"] group(1)=["+m.group(1)+"]");
       input = input.replaceAll(m.group(), m.group(1) + "x");
   }

这将产生：

input=[i am am 2 am am am 1 am am a good man] group=[am am] group(1)=[am]
input=[i amx 2 amx am 1 amx a good man] group=[am am am] group(1)=[am]
input=[i amx 2 amx am 1 amx a good man] group=[am am] group(1)=[am]
i amx 2 amx am 1 amx a good man

尽管“上午” variable having (after the first loop) no是子字符串，但匹配者仍然可以找到它们。

按照您的方法（可能有更优雅或更实用的方式）的精神进行修复

   while( true ) {
      Matcher m = p.matcher(input);
      if(!m.find()) break;
      input = input.replaceAll(m.group(), m.group(1) );
   }

或更简单一些：

   while( true ) {
      String modif = input.replaceAll("\\b(\\w+)(\\s+\\1\\b)", "$1");
      if(modif.equals(input)) break;
      input = modif;
   }

Java正则表达式存在问题，无法找到重复的单词“ \\ b（\\ w +）（\\ s + \\ 1 \\ b）+”

2 个答案: