Question

我正在尝试构建一个正则表达式来“减少”Java中字符串中重复的连续子串。例如，对于以下输入：

The big black dog is a friendly dog who lives nearby.

我想获得以下输出：

String input = "The big black dog big black dog is a friendly friendly dog who lives nearby nearby.";

Pattern dupPattern = Pattern.compile("((\\b\\w+\\b\\s)+)\\1+", Pattern.CASE_INSENSITIVE);
Matcher matcher = dupPattern.matcher(input);

while (matcher.find()) {
    input = input.replace(matcher.group(), matcher.group(1));
}

这是我到目前为止的代码：

The big black dog is a friendly dog who lives nearby nearby.

对于除句子结尾之外的所有重复子字符串，这样做很好：

{{1}}

据我所知，我的正则表达式在子字符串中的每个单词后面都需要一个空格，这意味着它不会捕获具有句点而不是空格的句子。我似乎无法为此找到解决方法，我已尝试使用捕获组并更改正则表达式以查找空格或句点而不仅仅是空格，但此解决方案仅在存在时才有效子串的每个重复部分之后的句点（“nearby.nearby。”）。

有人能指出我正确的方向吗？理想情况下，此方法的输入将是短段而不仅仅是单行。

Answer 1

您可以使用

input.replaceAll("([ \\w]+)\\1", "$1");

请参阅live demo:

import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Ideone
{
    public static void main (String[] args) throws java.lang.Exception
    {
        String input = "The big black dog big black dog is a friendly friendly dog who lives nearby nearby.";

        Pattern dupPattern = Pattern.compile("([ \\w]+)\\1", Pattern.CASE_INSENSITIVE);
        Matcher matcher = dupPattern.matcher(input);

        while (matcher.find()) {
            input = input.replaceAll("([ \\w]+)\\1", "$1");
        }
        System.out.println(input);

    }
}

Answer 2

结合@Thomas Ayoub的回答和@Matt的评论。

public class Test2 {
    public static void main(String args[]){
        String input = "The big big black dog big black dog is a friendly friendly dog who lives nearby nearby.";
        String result = input.replaceAll("\\b([ \\w]+)\\1", "$1");
        while(!input.equals(result)){
            input = result;
            result = input.replaceAll("\\b([ \\w]+)\\1", "$1");
        }
        System.out.println(result);
    }
}

Java正则表达式从字符串中删除重复的子字符串

2 个答案: