Question

我在PDFBox中解析pdf以从中提取所有文本

public static void main(String args[]) {
    PDFTextStripper pdfStripper = null;
    PDDocument pdDoc = null;
    COSDocument cosDoc = null;
    File file = new File("C:\\Users\\admin\\Downloads\\Airtel.pdf");
    try {
        PDFParser parser = new PDFParser(new FileInputStream(file));
        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        pdfStripper.setStartPage(1);
        pdfStripper.setEndPage(1);
        String parsedText = pdfStripper.getText(pdDoc);
        System.out.println(parsedText);
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } 
  }

但它没有在输出中给出任何文字帮助

Answer 1

PDFBox提取类似于Adobe Reader复制和粘贴文本的文本。

如果您在Adobe Reader中打开文档并按＆lt; Ctrl-A＆gt;标记所有文本（你会看到几乎没有任何标记）并将其复制并粘贴到编辑器中，你会发现Adobe Reader也很难提取任何内容。

PDFBox和Adobe Reader（也没有任何其他普通文本提取器）从文档中提取文本的原因是它几乎没有任何文本！您看到的“文本”不是使用文本绘制操作绘制的，而是通过将每个“字符”的轮廓定义为路径并填充该路径中的区域来绘制。因此，没有迹象表明文本提取器甚至还有文本。

文档中实际上有两个字符，“上一个余额”和“付款”框之间的“ - ”符号以及“付款”和“调整”框之间的“ - ”符号。甚至这两个字符也没有按照需要提取，因为字体不提供这些字符所代表的Unicode代码点信息。

因此，提取文档的文本内容几乎是唯一的机会是将OCR应用于文档。

PDFBOX没有提供正确的输出

1 个答案: