Question

我尝试在常规.pdf文件上使用PDFBox，但它运行正常。

但是，当我遇到损坏的.pdf时，代码会“冻结”.. 不会抛出错误或其他内容 ..只需load或parse功能需要永远执行

这是the corrupted file（我已将其压缩以便每个人都可以下载），它可能不是原生pdf文件，但它保存为.pdf扩展名，只有4 Kb。

我根本不是专家，但我认为这是PDFBox的一个错误。根据文档，load()和parse()方法都应该在失败时抛出异常。但是对于我的文件，代码将永远执行而不会抛出异常。

我尝试过只使用load，可以尝试parse() ..结果是一样的

import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class TestTest {

    public static void main(String[] args) throws FileNotFoundException, IOException {
        System.out.println(pdfToText("C:\\..............MYFILE.pdf")); 
        System.out.println("done ! ! !");
    }
    private static String pdfToText(String fileName) throws IOException {
        PDDocument document = null;
        document = PDDocument.load(new File(fileName)); // THIS TAKES FOREVER
        PDFTextStripper stripper = new PDFTextStripper();
        document.close();
        return stripper.getText(document);
    }
}

如果.pdf文件已损坏，如何强制此代码抛出异常或停止执行？感谢

Answer 1

试试这个解决方案：

private static String pdfToText(String fileName) {
    PDDocument document = null;
    try {
        document = PDDocument.load(fileName);
        PDFTextStripper stripper = new PDFTextStripper();
        return stripper.getText(document);
    } catch (IOException e) {
        System.err.println("Unable to open PDF Parser. " + e.getMessage());
        return null;
    } finally {
        if (document != null) {
            try {
                document.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

Answer 2

为了实现第三方库的简单超时，我经常使用像Apache Commons ThreadMonitor这样的实现：

long timeoutInMillis = 1000;

try {
    Thread monitor = ThreadMonitor.start(timeoutInMillis);  
    // do some work here
    ThreadMonitor.stop(monitor);
} catch (InterruptedException e) {
    // timed amount was reached
}

示例代码来自Apache的ThreadMonitor Javadoc。当我第三方API没有提供某种超时机制时，我只使用它。

但是几周前我被迫稍微调整一下，因为这个解决方案不适用于使用异常屏蔽的（第三方）代码。

特别是我们遇到了c3p0的问题，它掩盖了所有异常（特别是InterruptedException s）。我们的解决方案是调整实现以检查InterruptedException s的异常的原因链。

PDFBox中的load（）和parse（）方法可能存在错误？

2 个答案: