Question

我正在尝试使用Apache PDFBOX API V2.0.2拆分包含300页的文档。尝试使用以下代码将pdf文件拆分为单个页面时：

        PDDocument document = PDDocument.load(inputFile);
        Splitter splitter = new Splitter();
        List<PDDocument> splittedDocuments = splitter.split(document); //Exception happens here

我收到以下异常

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

这表明GC需要花费很多时间来清除被回收金额不合理的堆。

有许多JVM调整方法可以解决这种情况，但是，所有这些只是处理症状而不是真正的问题。

最后一点，我正在使用JDK6，因此使用新的java 8 Consumer不是我的选择。谢谢

修改

这不是http://stackoverflow.com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2的重复问题：

 1. I do not have the size problem mentioned in the aforementioned
    topic. I am slicing a 270 pages 13.8MB PDF file and after slicing
    the size of each slice is an average of 80KB with total size of
    30.7MB.
 2. The Split throws the exception even before it returns the splitted parts.

我发现只要我没有通过整个文件就可以通过分割，而是将其传递给＃34;批次＆＃34;每个20-30页，这样做。

Answer 1

PDF Box将分割操作产生的部分作为PDDocument类型的对象存储在堆中作为对象，这导致堆快速填充，即使在循环中的每一轮之后调用close（）操作，仍然GC将无法以与填充相同的方式回收堆大小。

一个选项是将文档拆分操作拆分为批处理，其中每个批处理是一个相对可管理的块（10到40页）

public void execute() {
    File inputFile = new File(path/to/the/file.pdf);
    PDDocument document = null;
    try {
        document = PDDocument.load(inputFile);

        int start = 1;
        int end = 1;
        int batchSize = 50;
        int finalBatchSize = document.getNumberOfPages() % batchSize;
        int noOfBatches = document.getNumberOfPages() / batchSize;
        for (int i = 1; i <= noOfBatches; i++) {
            start = end;
            end = start + batchSize;
            System.out.println("Batch: " + i + " start: " + start + " end: " + end);
            split(document, start, end);
        }
        // handling the remaining
        start = end;
        end += finalBatchSize;
        System.out.println("Final Batch  start: " + start + " end: " + end);
        split(document, start, end);

    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        //close the document
    }
}

private void split(PDDocument document, int start, int end) throws IOException {
    List<File> fileList = new ArrayList<File>();
    Splitter splitter = new Splitter();
    splitter.setStartPage(start);
    splitter.setEndPage(end);
    List<PDDocument> splittedDocuments = splitter.split(document);
    String outputPath = Config.INSTANCE.getProperty("outputPath");
    PDFTextStripper stripper = new PDFTextStripper();

    for (int index = 0; index < splittedDocuments.size(); index++) {
        String pdfFullPath = document.getDocumentInformation().getTitle() + index + start+ ".pdf";
        PDDocument splittedDocument = splittedDocuments.get(index);

        splittedDocument.save(pdfFullPath);
    }
}

Apache PDFBOX - 使用split时获取java.lang.OutOfMemoryError（PDDocument文档）

1 个答案: