Question

我在使用pdfclown库的textextractor时遇到错误。我使用的代码是

TextExtractor textExtractor = new TextExtractor(true, true);
for(final Page page : file.getDocument().getPages())
{
  System.out.println("\nScanning page " + (page.getIndex()+1) + "...\n");

  //  Extract the page text!
  Map textStrings = textExtractor.extract(page);

我得到的错误的一部分是

exception in thread 'main' java.lang.exceptionininitializer error
at org.pdfclown.document.contents.fonts.encoding.put
at ......
at ......
<about 30 such lines>
caused by java.lang.nullpointerexception
at java.io.reader.<init><Reader.java:78>
at java.io.inputstreamreader
<about 30 lines more>

我还发现，当我的pdf包含一些项目符号时会发生这种情况

第1项
第2项
第3项

Plz帮助我从这样的pdf中提取文本。

Answer 1

（以下评论证明是解决方案：）

使用您的highlighter.java课程（评论中提供on your google drive）以及当前PDF Clown主干版本作为jar，处理PDF时没有发生任何意外，特别是没有NullPointerException（重点突出显示）但是，部分地不在正确的位置。

在查看了您的共享Google驱动器内容后，我假设您没有使用PDF Clown jar，而只是编译了分发源文件夹中的类并使用它们。

PDF Clown jar文件包含其他资源，但您的设置不包含这些资源。因此：

您的highlighter.java必须与类路径中的pdfclown.jar一起使用。

使用pdfclown函数'textextractor'提取文本

1 个答案: