Question

任何人都可以帮我搜索多个pdf文件中的单词并获得单词计数吗？

我需要在每个文档中按字数的降序显示pdf，我应该在java中这样做。

Answer 1

您可以使用PDFBox计算PDF文件中的字词：

public static int countWordInFile(String word, String filename, String fileEncoding) throws Exception {
    int count=0;
    PrintStream ps = null;
    PrintStream originalSystemOut = System.out;

    try {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        ps = new PrintStream(baos);
        System.setOut(ps);

        // Extracting text from page
        ExtractText.main(new String[] {//
                //
                        "-encoding", fileEncoding, //
                        "-console", //
                        filename //
                //
                });

        String content = baos.toString(fileEncoding);

        // TODO: Find the word in content and count its occurences...

    } finally {
        IOUtils.closeQuietly(ps);
        System.setOut(originalSystemOut);
    }

    return count;
}

Answer 2

获取数据
下载iText（PDF工具），打开您要扫描的所有PDF文件，阅读其中的文字，制作一个HashMap来存储文字 - ＆gt;数（字）。

对您的hashmap进行排序：
这个问题已经通过stackoverflow解决了：Sort a Map<Key, Value> by values (Java)

Answer 3

您似乎正在寻找一个起点或想法，而不是一个特定的解决方案 - 您可以在这里找到一些选择。

首先，您需要确保可以搜索PDF的文本内容。例如，使用Adobe Acrobat。one way。

其次，您需要使用某种API来索引PDF文件，以便可以搜索它们。这是Apache Lucene网站上的section，可能会给你一些提示。

Apache Lucene是一种高性能，功能齐全的文本搜索完全用Java编写的引擎库。

请记住，您的问题中没有太多上下文，因此索引PDF或Lucene对您来说可能过度。

我建议谷歌搜索一些方法 - 尝试＆＃34;文本搜索pdf文件＆＃34;，＆＃34;阅读pdf文件java＆＃34;等

这里还有another answer来帮助你。

在多个pdf文件中搜索单词并根据单词计数索引pdf

3 个答案: