将PDF文档拆分为多个文档

时间:2016-05-25 21:33:14

标签: pdfbox

我试图将PDF文档拆分为多个文档,其中每个文档包含文件大小小于最大文件大小的最大页数。

我的代码目前在从Eclipse运行时有效,但是当我单击.jar文件时,java类中的静态方法似乎崩溃了(但我似乎无法捕获异常)。

不起作用的代码是:

myListOfDocuments = mysplitter.split(document);

当调用上面的行时,JVM以某种方式对静态方法保释。负载似乎工作正常,如下所示: PDDocument document = PDDocument.load(aFile);

有什么想法吗?

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;

import org.apache.pdfbox.multipdf.Splitter;
import org.apache.pdfbox.pdmodel.PDDocument;

public class PDFMaxSizeSplitter {


    public static void main(String[] args) {
    }

    public static ArrayList<File> splitTheFile(File aFile,long maxSize){

        ArrayList<File> resultFiles = new ArrayList<File>();

        //Checks to see if file is already small enough
        if (aFile.length() <= maxSize){
            resultFiles.add(aFile);
            return resultFiles;
        }

        //checks to see if it's a directory
        if (aFile.isDirectory()){
            resultFiles.add(aFile);
            return resultFiles;
        }

        try {

            PDDocument document = PDDocument.load(aFile);
            Splitter mysplitter = new Splitter();
            List<PDDocument> myListOfDocuments = mysplitter.split(document);
            int docNumber = 0;
            while (myListOfDocuments.size()>0){
                long theResults = 0;
                theResults = getChunk(myListOfDocuments,0,(long) (myListOfDocuments.size()-1),maxSize);
                PDDocument newPDFDoc = new PDDocument();
                for (long pageindex=0; pageindex<=theResults; pageindex++){
                    newPDFDoc.addPage(myListOfDocuments.get((int) pageindex).getPage(0)); 
                }
                File newFile = new File(aFile.getParentFile() +
                                        File.separator +
                                        aFile.getName().replace(".pdf", "") +
                                        "Part" +
                                        String.format("%03d", docNumber) +
                                        ".pdf");
                //System.out.println(newFile.getCanonicalFile());
                newPDFDoc.save(newFile);
                resultFiles.add(newFile);
                myListOfDocuments=myListOfDocuments.subList((int) (theResults)+1, (myListOfDocuments.size()));
                newPDFDoc.close();
                docNumber++;
            }

            document.close();


        } catch (IOException e) {
            e.printStackTrace();
            }
        return resultFiles;
        }

    private static long getChunk(List<PDDocument> thePages, long lowPage, long highPage, long maxSize) throws IOException{
        //System.out.println("low " + lowPage + " high page: " + highPage);
        if ( (highPage-lowPage)<=1 ){
            if(PDFMaxSizeSplitter.testSize(thePages,0,highPage)<=maxSize){
                return highPage;
            } else{
                return lowPage;
            }

        } else if (PDFMaxSizeSplitter.testSize(thePages, 0,lowPage+ (highPage-lowPage)/2)<=maxSize){
            return PDFMaxSizeSplitter.getChunk(thePages, lowPage + (highPage-lowPage)/2, highPage,maxSize);
        }
            else {
                return PDFMaxSizeSplitter.getChunk(thePages, lowPage, lowPage + (highPage-lowPage)/2,maxSize);
            }
    }

    private static long testSize(List<PDDocument> thePages, long start, long stop) throws IOException{
        //System.out.println("Trying: " + (new Long(start)).toString() + " to " + (new Long(stop)).toString()); 
        PDDocument testerdocument = new PDDocument();
        //Path tempPath = Files.createTempFile((new Long(start)).toString(), (new Long(stop)).toString());
        //System.out.println("Creating tempPath " +tempPath.toString());    
        //File tempFile=new File(tempPath.toString());
        ByteArrayOutputStream tempFile = new ByteArrayOutputStream();
        for (long pageindex=start; pageindex<=stop; pageindex++){
            testerdocument.addPage(thePages.get((int) pageindex).getPage(0)); 
        }
        testerdocument.save(tempFile);
        long thefilesize = tempFile.size();
        //long thefilesize =  (tempFile.length());
        //Files.deleteIfExists(tempPath);
        tempFile.reset();
        testerdocument.close();
        return thefilesize;
    }
}

----------- --------------编辑

事实证明JVM内存不足。

1 个答案:

答案 0 :(得分:0)

事实证明JVM内存不足。我添加了一个jvm参数来增加内存。此外,我通过在jvm上使用参数-d64切换到64位jvm模式。另外,我一直在使用pdfbox中的磁盘驱动器缓存内存管理,例如,新的PDDocument(aFile,MemoryUsageSetting.setupTempFileOnly());

通过这些设置,我可以处理几千兆字节的文件。现在在代码中,我尝试将文档加载到直接内存中并捕获内存不足异常以切换到低内存模式。在低内存模式下,我使用MemoryUsageSetting.setupTempFileOnly()来避免使用太多的堆。