使用PDFBox 2.x计算PDF图像的最快方法

时间:2016-07-19 17:55:48

标签: java pdf pdfbox

我们偶尔会遇到一些非常大的PDF,这些PDF充满了整页,高分辨率图像(文档扫描的结果)。例如,我有一个带有3500+图像的1.7GB PDF。加载文档大约需要50秒,但计算图像大约需要15分钟。

我确定这是因为图像字节是作为API调用的一部分读取的。有没有办法提取图像计数而不实际读取图像字节?

PDFBox版本:2.0.2

示例代码:

from django.http import JsonResponse
from django.shortcuts import get_object_or_404
from django.views.decorators.http import require_http_methods


@require_http_methods(['POST'])
def change_username(request):
    if not request.is_ajax():
        return render(request, "auths/edits/edit_username.html", {})
    else:
        new_username = request.POST.get('username')
        user = get_object_or_404(User, id=request.user.id)
        initial_username = user.username
        user.username = new_username
        user.save()

        data = {}

        if user.username == new_username:
            data['data'] = "Username successfully updated!"
        else:
            data['data'] = "Something went wrong!"

        return JsonResponse(data)

如果我将countImages方法更改为依赖于COSName,则计数在不到1秒内完成,但我对依赖名称前缀有点不确定。这似乎是pdf编码器的副产品,而不是PDFBox(我无法在代码中找到任何对它的引用):

@Test
public void imageCountIsCorrect() throws Exception {
    PDDocument pdf = readPdf();
    try {
        assertEquals(3558, countImages(pdf));
        // assertEquals(3558, countImagesWithExtractor(pdf));
    } finally {
        if (pdf != null) {
            pdf.close();
        }
    }
}

protected PDDocument readPdf() throws IOException {
    StopWatch stopWatch = new StopWatch();
    stopWatch.start();

    FileInputStream stream = new FileInputStream("large.pdf");
    PDDocument pdf;
    try {
        pdf = PDDocument.load(stream, MemoryUsageSetting.setupMixed(1024 * 1024 * 250));
    } finally {
        stream.close();
    }

    stopWatch.stop();
    log.info("PDF loaded: time={}s", stopWatch.getTime() / 1000);
    return pdf;
}


protected int countImages(PDDocument pdf) throws IOException {
    StopWatch stopWatch = new StopWatch();
    stopWatch.start();

    int imageCount = 0;
    for (PDPage pdPage : pdf.getPages()) {
        PDResources pdResources = pdPage.getResources();
        for (COSName cosName : pdResources.getXObjectNames()) {
            PDXObject xobject = pdResources.getXObject(cosName);
            if (xobject instanceof PDImageXObject) {
                imageCount++;
                if (imageCount % 100 == 0) {
                    log.info("Found image: #" + imageCount);
                }
            }
        }
    }

    stopWatch.stop();
    log.info("Images counted: time={}s,imageCount={}", stopWatch.getTime() / 1000, imageCount);
    return imageCount;
}

1 个答案:

答案 0 :(得分:0)

因此,之前的方法存在一些额外的缺陷(可能会错过内嵌图像等)。感谢mkl和Tilman Hausherr的反馈!

TIL - PDF object streams contain useful operator codes!

我的新方法扩展了PDFStreamEngine,并为PDF内容流中找到的每个“Do”(绘图对象)运算符增加了imageCount。使用此方法,图像计数只需几百毫秒:

public class PdfImageCounter extends PDFStreamEngine {
    protected int documentImageCount = 0;

    public int getDocumentImageCount() {
        return documentImageCount;
    }

    public PdfImageCounter() {
        addOperator(new OperatorProcessor() {
            @Override
            public void process(Operator operator, List<COSBase> arguments) throws IOException {
                if (arguments.size() < 1) {
                    throw new MissingOperandException(operator, arguments);
                }
                if (isImage(arguments.get(0))) {
                    documentImageCount++;
                }
            }

            protected Boolean isImage(COSBase base) {
                return (base instanceof COSName) &&
                        context.getResources().isImageXObject((COSName)base);
            }

            @Override
            public String getName() {
                return "Do";
            }
        });
    }
}

为每个页面调用它:

protected int countImagesWithProcessor(PDDocument pdf) throws IOException {
    StopWatch stopWatch = new StopWatch();
    stopWatch.start();

    PdfImageCounter counter = new PdfImageCounter();
    for (PDPage pdPage : pdf.getPages()) {
        counter.processPage(pdPage);
    }

    stopWatch.stop();
    int imageCount = counter.getDocumentImageCount();
    log.info("Images counted: time={}s,imageCount={}", stopWatch.getTime() / 1000, imageCount);
    return imageCount;
}