我正在使用PDFBox解析PDF,并将pdf的内容放入数组列表中,然后我需要删除重复的单词。这是我尝试过的。
List <String> ContentList = new ArrayList<String>();
List<String> noRepeat = new ArrayList<String>();
ContentList.add(indexed.content);
for(String s : ContentList)
{
String result = s.replaceAll("(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+", "$1");
noRepeat.add(result);
}
System.out.println(noRepeat);
我没有在代码下面使用ArrayList。
String duplicatePattern = "(?i)\\b(\\w+)\\b[\\w\\W]*\\b\\1\\b";
Pattern pp = Pattern.compile(duplicatePattern);
Matcher m = pp.matcher(indexed.content);
while (m.find()) {
System.out.println(m.group(1));
}
内容的一小部分:
Supervised and Unsupervised
Learning
Agenda
● Introduction
● Supervised Learning
● Unsupervised Learning
What is ML ?
● Field of study that gives computers the
ability to learn without being explicitly
programmed
Uniformity of cell size
Uniformity of cell shape
代码应该只占用一个Supervised
,一个Learning
,一个Uniformity
而不是每个Supervised
,Learning
,Uniformity
等...
更新
我编码了这个并且它有效。
Set<String> indexedContentSet = new HashSet<>();
String[] words = indexed.content.split("\\s+");
Set<String> set = new HashSet<>();
for(String word : words)
{
if(!set.add(word))
{
indexedContentSet.add(word);
}
}
set
显示每个单词,indexedContentSet
仅显示重复单词。
我还可以比较set
和indexedContentSet
,并从indexedContentSet
中删除set
值的字词吗?
我试过这个并没有用。
if (set.contains(indexedContentSet)) {
set.remove(indexedContentSet)
}
如何从Set中删除短词?
在程序找到重复的单词之前,我将indexed.content.replaceAll("\\b\\w{1,4}\\b\\s?", "");
放在Set<String> indexedContentSet = new HashSet<>();
之上但不起作用。
答案 0 :(得分:1)
您应该使用Set
,因为它们旨在包含不同的元素。
Set<String> uniqueWords = new HashSet<>();
uniqueWords.addAll(words);
要删除短元素,您可以过滤该组 使用Java 8:
uniqueWords.stream().filter(word -> word.length() > 4).collect(Collectors.toSet());
// returns a new Set that contains the words of uniqueWords of 5 or more characters
使用Java&lt; 8:
Iterator<String> wordsIt = uniqueWords.iterator();
while (wordsIt.hasNext()) {
if (wordsIt.next().length() < 5) { wordsIt.remove(); }
}
// at this point the uniqueWords Set only contains words of 5 or more characters
答案 1 :(得分:1)
花了一些时间,因为我必须重新创建一切
Set<String> indexedContentSet = new HashSet<>(); //It contains only the words that are repeated twice or more than that
Set<String> set = new HashSet<>(); //This contains all unique words
String tmp; //This variable reads line from user input
for (int i = 0;i < 12;i++) { //There are 12 lines for input
tmp = x.nextLine(); //Read each line
String arr[] = tmp.split("\\s+"); //Split on the basis of space
for (String y: arr) { //For each word in the line do the following
if (y.length() > 3) { //If the length of word is more than 3, then only include it in set
if (set.contains(y)) { //If unique word set already contains that element, then its a duplicate..So add it to indexedContentSet
indexedContentSet.add(y); //If you want to add only in lowercase, you can use indexedContentSet.add(y.toLowerCase());
}
set.add(y); //Add all the words to set(which will finally be unique)..Also if you want to add only in lowercase, you can use set.add(y.toLowerCase());
}
}
}