Question

我编写了以下函数，jut使用PDFBox工具打印出PDF中的文本：

private String readFirstNChars(int N) { // N has not been used
    PDFTextStripper pdfTextStripper = null;
    PDDocument pdDocument = null;
    COSDocument cosDocument = null;
    File currentFile = this.pdfFile;

    try {
        PDFParser parser = new PDFParser(new RandomAccessBufferedFileInputStream(currentFile));
        parser.parse();
        cosDocument = parser.getDocument();
        pdfTextStripper = new PDFTextStripper();
        pdDocument = new PDDocument(cosDocument);
        pdfTextStripper.setStartPage(1);
        pdfTextStripper.setEndPage(1);
        String parsedText = pdfTextStripper.getText(pdDocument);
        return parsedText;
    } catch (IOException e) {
        e.printStackTrace();
        return null;
    }
}

我正在考虑打印N的{{1}}个字符，但后来我想知道我能阅读的文件是否非常大，这种方法没有任何意义，即加载整个文本在内存中然后获得第一个parsedText个字符。有没有办法只能从PDF中读取N个字符？

Answer 1

您可能需要PDFParser的源代码，以便您可以编写适当的方法或编写自己的方法。 PDF不仅仅是可读文本，因此基本上您需要解析文档，丢弃不可读的文本，然后保留您找到的实际文本的计数。

使用PDFBox

1 个答案: