pdfbox提取单词坐标

时间:2018-07-08 03:54:46

标签: coordinates extract pdfbox word

嗨,这个问题是指以前的帖子:

Could someone give me an example of how to extract coordinates for a 'word' using PDFBox

我正在使用PDFBOX 2.0.10

我已经成功编译了组合代码,但是在尝试运行示例时出现异常错误。

提供的解决方案没有标准的主要方法,这使我感到困惑。

有人可以建议我如何成功运行组合代码。

package org.apache.pdfbox.examples.text;
import java.io.File;
import java.io.IOException;
import java.io.ByteArrayOutputStream;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
import org.apache.pdfbox.examples.text.ExtractWordCoordinates;
public class ExtractWordCoordinates2 {
    public static void main(String[] args) throws IOException {
        ExtractWordCoordinates ewc = new ExtractWordCoordinates();
       ewc.testExtractWordsForGoodJuJu();
    }
}

堆栈跟踪

Jul 08, 2018 4:15:41 PM org.apache.pdfbox.pdmodel.graphics.color.PDDeviceRGB suggestKCMS
INFO: To get higher rendering speed on java 8 oder 9,
Jul 08, 2018 4:15:41 PM org.apache.pdfbox.pdmodel.graphics.color.PDDeviceRGB suggestKCMS
INFO:   use the option -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
Jul 08, 2018 4:15:41 PM org.apache.pdfbox.pdmodel.graphics.color.PDDeviceRGB suggestKCMS
INFO:   or call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")
Exception in thread "main" java.lang.NullPointerException
        at org.apache.pdfbox.io.ScratchFile.createBuffer(ScratchFile.java:422)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1142)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1041)
        at org.apache.pdfbox.examples.text.ExtractWordCoordinates.testExtractWordsForGoodJuJu(ExtractWordCoordinates.java:47)
        at org.apache.pdfbox.examples.text.ExtractWordCoordinates2.main(ExtractWordCoordinates2.java:17)

ExtractWordCoordinates可在此处找到 https://github.com/mkl-public/testarea-pdfbox2/blob/master/src/test/java/mkl/testarea/pdfbox2/extract/ExtractWordCoordinates.java#L69

1 个答案:

答案 0 :(得分:0)

问题解决了。

ExtractWordCoordinates中的以下行返回空值:

    try (   InputStream resource = getClass().getResourceAsStream("apache.pdf")) {

将文档(apache.pdf)复制到与ExtractWordCoordinates.class相同的目录后,代码成功运行。