我想使用正则表达式从文件中删除所有重复的单词。
例如:
The university of Hawaii university began using began radio.
输出:
The university of Hawaii began using radio.
我写了这个正则表达式:
String regex = "\\b(\\p{IsAlphabetic}+)(\\s+\\1\\b)+";
只删除单词后面的单词。
例如:
The university university of Hawaii Hawaii began using radio.
输出:The university of Hawaii began using radio.
我的代码正则表达式:
文件目录=新文件(" C:/ Users / Arnoldas / workspace / uplo /");
String source = dir.getCanonicalPath() + File.separator + "Output.txt";
String dest = dir.getCanonicalPath() + File.separator + "Final.txt";
File fin = new File(source);
FileInputStream fis = new FileInputStream(fin);
BufferedReader in = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
//FileWriter fstream = new FileWriter(dest, true);
OutputStreamWriter fstream = new OutputStreamWriter(new FileOutputStream(dest, true), "UTF-8");
BufferedWriter out = new BufferedWriter(fstream);
String regex = "\\b(\\p{IsAlphabetic}+)(\\s+\\1\\b)+";
//String regex = "(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
String aLine;
while ((aLine = in.readLine()) != null) {
Matcher m = p.matcher(aLine);
while (m.find()) {
aLine = aLine.replaceAll(m.group(), m.group(1));
}
//Process each line and add output to *.txt file
out.write(aLine);
out.newLine();
out.flush();
}
答案 0 :(得分:0)
您可以改用Streams:
String s = "The university university of Hawaii Hawaii began using radio.";
System.out.println(Arrays.asList(s.split(" ")).stream().distinct().collect(Collectors.joining(" ")));
在此示例中,String沿着空白分割,而不是转换为流。使用distinct()删除重复项,最后将所有重复项连接在一起。
但这种方法最后会出现问题。 "无线电"和#34;电台。"是不同的词。
答案 1 :(得分:0)
你是在正确的轨道上,但如果重复之间可能有文字 它必须在循环中完成(对于"开始......开始......开始")。
String s = "The university of Hawaii university began using began radio.";
for (;;) {
String t = s.replaceAll("(?i)\\b(\\p{IsAlphabetic}+)\\b(.*?)\\s*\\b\\1\\b",
"$1$2");
if (t.equals(s)) {
break;
}
s = t;
}
对于不区分大小写的替换:使用(?i)
。
由于正则表达式必须回溯,因此非常效率低下。
只需将所有单词都放在Set
。
// Java 9
Set<String> corpus = Set.of(s.split("\\P{IsAlphabetic}+"));
// Older java:
Set<String> corpus = new TreeSet<>();
Collections.addAll(set, s.split("\\P{IsAlphabetic}+"));
corpus.remove("");
评论后
正则表达式只能查找带有可选空格的单词。使用集来检查重复项。
Path dir = Paths.get("C:/Users/Arnoldas/workspace/uplo");
Path source = dir.resolve("Output.txt");
String dest = dir.resolve("Final.txt");
String regex = "(\\s*)\\b\\(p{IsAlphabetic}+)\\b";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
try (BufferedReader in = Files.newBufferedReader(source);
BufferedWriter out = new BufferedWriter(dest)) {
String line;
while ((line = in.readLine()) != null) {
Set<String> words = new HashSet<>();
Matcher m = p.matcher(line);
StringBuffer sb = new StringBuffer();
while (m.find()) {
boolean added = words.add(m.group(2).toLowerCase());
m.appendReplacement(sb, added ? m.group() : "");
}
m.appendTail(sb);
out.write(sb.toString());
out.newLine();
}
}
答案 2 :(得分:0)
试试这个正则表达式:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.