Question

我用Java编写了一个简单的程序，使用PDFBox从PDF文件中提取单词。它会从PDF中读取文本并逐字提取。

public class Main {

    public static void main(String[] args) throws Exception {
        try (PDDocument document = PDDocument.load(new File("C:\\my.pdf"))) {

            if (!document.isEncrypted()) {

                PDFTextStripper tStripper = new PDFTextStripper();
                String pdfFileInText = tStripper.getText(document);
                String lines[] = pdfFileInText.split("\\r?\\n");
                for (String line : lines) {
                    System.out.println(line);
                }

            }
        } catch (IOException e){
            System.err.println("Exception while trying to read pdf document - " + e);
        }
    }

}

有没有一种方法可以提取没有重复的单词？

Answer 1

通过space-line.split(" ")
维护一个HashSet以容纳这些单词，并继续在其中添加所有单词。

HashSet本质上将忽略重复项。

HashSet<String> uniqueWords = new HashSet<>();

for (String line : lines) {
  String[] words = line.split(" ");

  for (String word : words) {
    uniqueWords.add(word);
  }
}

Answer 2

如果您的目标是删除重复项，那么您可以通过将数组添加到java.util.Set中来实现。所以，现在，您只需要这样做：

Set<String> noDuplicates = new HashSet<>( Arrays.asList( lines ) );

不再重复。

Java-从PDF文件提取非重复单词

2 个答案: