Question

我想使用正则表达式从文件中删除所有重复的单词。

例如：

 The university of Hawaii university began using began radio.

输出：

 The university of Hawaii began using radio.

我写了这个正则表达式：

 String regex = "\\b(\\p{IsAlphabetic}+)(\\s+\\1\\b)+";

只删除单词后面的单词。

例如： The university university of Hawaii Hawaii began using radio.

输出：The university of Hawaii began using radio.

我的代码正则表达式：

文件目录=新文件（＆＃34; C：/ Users / Arnoldas / workspace / uplo /＆＃34;）;

            String source = dir.getCanonicalPath() + File.separator + "Output.txt";
            String dest = dir.getCanonicalPath() + File.separator + "Final.txt";

            File fin = new File(source);
            FileInputStream fis = new FileInputStream(fin);
            BufferedReader in = new BufferedReader(new InputStreamReader(fis, "UTF-8"));

            //FileWriter fstream = new FileWriter(dest, true);
            OutputStreamWriter fstream = new OutputStreamWriter(new FileOutputStream(dest, true), "UTF-8");

            BufferedWriter out = new BufferedWriter(fstream);

            String regex = "\\b(\\p{IsAlphabetic}+)(\\s+\\1\\b)+";

            //String regex = "(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+";
            Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

            String aLine;
            while ((aLine = in.readLine()) != null) {

                Matcher m = p.matcher(aLine);
                while (m.find()) {
                    aLine = aLine.replaceAll(m.group(), m.group(1));
                }

                //Process each line and add output to *.txt file
                out.write(aLine);
                out.newLine();
                out.flush();
            }

Answer 1

您可以改用Streams：

String s = "The university university of Hawaii Hawaii began using radio.";
System.out.println(Arrays.asList(s.split(" ")).stream().distinct().collect(Collectors.joining(" ")));

在此示例中，String沿着空白分割，而不是转换为流。使用distinct（）删除重复项，最后将所有重复项连接在一起。

但这种方法最后会出现问题。＆＃34;无线电＆＃34;和＃34;电台。＆＃34;是不同的词。

Answer 2

你是在正确的轨道上，但如果重复之间可能有文字它必须在循环中完成（对于＆＃34;开始......开始......开始＆＃34;）。

String s = "The university of Hawaii university began using began radio.";
for (;;) {
    String t = s.replaceAll("(?i)\\b(\\p{IsAlphabetic}+)\\b(.*?)\\s*\\b\\1\\b",
                            "$1$2");
    if (t.equals(s)) {
        break;
    }
    s = t;
}

对于不区分大小写的替换：使用(?i)。

由于正则表达式必须回溯，因此非常效率低下。

只需将所有单词都放在Set。

中

// Java 9
Set<String> corpus = Set.of(s.split("\\P{IsAlphabetic}+"));

// Older java:
Set<String> corpus = new TreeSet<>();
Collections.addAll(set, s.split("\\P{IsAlphabetic}+"));

corpus.remove("");

评论后

原始代码的更正
使用文件和路径的新样式I / O，但仍然没有流
尝试使用资源自动关闭和关闭

正则表达式只能查找带有可选空格的单词。使用集来检查重复项。

    Path dir = Paths.get("C:/Users/Arnoldas/workspace/uplo");
    Path source = dir.resolve("Output.txt");
    String dest = dir.resolve("Final.txt");

    String regex = "(\\s*)\\b\\(p{IsAlphabetic}+)\\b";
    Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

    try (BufferedReader in = Files.newBufferedReader(source);
            BufferedWriter out = new BufferedWriter(dest)) {
        String line;
        while ((line = in.readLine()) != null) {
            Set<String> words = new HashSet<>();
            Matcher m = p.matcher(line);
            StringBuffer sb = new StringBuffer();
            while (m.find()) {
                boolean added = words.add(m.group(2).toLowerCase());
                m.appendReplacement(sb, added ? m.group() : "");
            }
            m.appendTail(sb);
            out.write(sb.toString());
            out.newLine();
        }
    }

Answer 3

试试这个正则表达式：

\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.

来源：Regular Expression For Consecutive Duplicate Words

如何使用正则表达式删除文件中的重复单词（单词不会连续）？

3 个答案: