从pdf的内容中删除重复的单词

时间:2016-04-14 19:16:07

标签: java regex

我正在使用PDFBox解析PDF,并将pdf的内容放入数组列表中,然后我需要删除重复的单词。这是我尝试过的。

    List <String> ContentList = new ArrayList<String>();
              List<String> noRepeat = new ArrayList<String>();
              ContentList.add(indexed.content);
              for(String s : ContentList)
              {
                  String result = s.replaceAll("(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+", "$1");
                  noRepeat.add(result);
              }
                    System.out.println(noRepeat);

我没有在代码下面使用ArrayList。

String duplicatePattern = "(?i)\\b(\\w+)\\b[\\w\\W]*\\b\\1\\b";
                Pattern pp = Pattern.compile(duplicatePattern);
                Matcher m = pp.matcher(indexed.content);
                while (m.find()) {
                    System.out.println(m.group(1));
                }

内容的一小部分:

Supervised and Unsupervised 
Learning
Agenda
● Introduction
● Supervised Learning
● Unsupervised Learning
What is ML ?
● Field of study that gives computers the 
ability to learn without being explicitly 
programmed
Uniformity of cell size
Uniformity of cell shape

代码应该只占用一个Supervised,一个Learning,一个Uniformity而不是每个SupervisedLearningUniformity等...

更新

我编码了这个并且它有效。

Set<String> indexedContentSet = new HashSet<>(); 

            String[] words = indexed.content.split("\\s+");

            Set<String> set = new HashSet<>(); 
            for(String word : words)
            { 
                if(!set.add(word))
                { 
                    indexedContentSet.add(word); 
                }
            }

set显示每个单词,indexedContentSet仅显示重复单词。 我还可以比较setindexedContentSet,并从indexedContentSet中删除set值的字词吗?

我试过这个并没有用。

if (set.contains(indexedContentSet)) {
                set.remove(indexedContentSet)
            }

如何从Set中删除短词? 在程序找到重复的单词之前,我将indexed.content.replaceAll("\\b\\w{1,4}\\b\\s?", "");放在Set<String> indexedContentSet = new HashSet<>();之上但不起作用。

2 个答案:

答案 0 :(得分:1)

您应该使用Set,因为它们旨在包含不同的元素。

Set<String> uniqueWords = new HashSet<>();
uniqueWords.addAll(words);

要删除短元素,您可以过滤该组 使用Java 8:

uniqueWords.stream().filter(word -> word.length() > 4).collect(Collectors.toSet());
// returns a new Set that contains the words of uniqueWords of 5 or more characters

使用Java&lt; 8:

Iterator<String> wordsIt = uniqueWords.iterator();
while (wordsIt.hasNext()) {
  if (wordsIt.next().length() < 5) { wordsIt.remove(); }
}
// at this point the uniqueWords Set only contains words of 5 or more characters

以下是演示:https://ideone.com/vRZu1Z

答案 1 :(得分:1)

花了一些时间,因为我必须重新创建一切

Set<String> indexedContentSet = new HashSet<>(); //It contains only the words that are repeated twice or more than that
Set<String> set = new HashSet<>(); //This contains all unique words

String tmp; //This variable reads line from user input

for (int i = 0;i < 12;i++) { //There are 12 lines for input
    tmp = x.nextLine();  //Read each line
    String arr[] = tmp.split("\\s+"); //Split on the basis of space

    for (String y: arr) { //For each word in the line do the following

        if (y.length() > 3) { //If the length of word is more than 3, then only include it in set

            if (set.contains(y)) {  //If unique word set already contains that element, then its a duplicate..So add it to indexedContentSet
                indexedContentSet.add(y); //If you want to add only in lowercase, you can use indexedContentSet.add(y.toLowerCase());
            }

            set.add(y); //Add all the words to set(which will finally be unique)..Also if you want to add only in lowercase, you can use set.add(y.toLowerCase());
        }   
     }
    }

Ideone Demo