我们偶尔会遇到一些非常大的PDF,这些PDF充满了整页,高分辨率图像(文档扫描的结果)。例如,我有一个带有3500+图像的1.7GB PDF。加载文档大约需要50秒,但计算图像大约需要15分钟。
我确定这是因为图像字节是作为API调用的一部分读取的。有没有办法提取图像计数而不实际读取图像字节?
PDFBox版本:2.0.2
示例代码:
from django.http import JsonResponse
from django.shortcuts import get_object_or_404
from django.views.decorators.http import require_http_methods
@require_http_methods(['POST'])
def change_username(request):
if not request.is_ajax():
return render(request, "auths/edits/edit_username.html", {})
else:
new_username = request.POST.get('username')
user = get_object_or_404(User, id=request.user.id)
initial_username = user.username
user.username = new_username
user.save()
data = {}
if user.username == new_username:
data['data'] = "Username successfully updated!"
else:
data['data'] = "Something went wrong!"
return JsonResponse(data)
如果我将countImages方法更改为依赖于COSName,则计数在不到1秒内完成,但我对依赖名称前缀有点不确定。这似乎是pdf编码器的副产品,而不是PDFBox(我无法在代码中找到任何对它的引用):
@Test
public void imageCountIsCorrect() throws Exception {
PDDocument pdf = readPdf();
try {
assertEquals(3558, countImages(pdf));
// assertEquals(3558, countImagesWithExtractor(pdf));
} finally {
if (pdf != null) {
pdf.close();
}
}
}
protected PDDocument readPdf() throws IOException {
StopWatch stopWatch = new StopWatch();
stopWatch.start();
FileInputStream stream = new FileInputStream("large.pdf");
PDDocument pdf;
try {
pdf = PDDocument.load(stream, MemoryUsageSetting.setupMixed(1024 * 1024 * 250));
} finally {
stream.close();
}
stopWatch.stop();
log.info("PDF loaded: time={}s", stopWatch.getTime() / 1000);
return pdf;
}
protected int countImages(PDDocument pdf) throws IOException {
StopWatch stopWatch = new StopWatch();
stopWatch.start();
int imageCount = 0;
for (PDPage pdPage : pdf.getPages()) {
PDResources pdResources = pdPage.getResources();
for (COSName cosName : pdResources.getXObjectNames()) {
PDXObject xobject = pdResources.getXObject(cosName);
if (xobject instanceof PDImageXObject) {
imageCount++;
if (imageCount % 100 == 0) {
log.info("Found image: #" + imageCount);
}
}
}
}
stopWatch.stop();
log.info("Images counted: time={}s,imageCount={}", stopWatch.getTime() / 1000, imageCount);
return imageCount;
}
答案 0 :(得分:0)
因此,之前的方法存在一些额外的缺陷(可能会错过内嵌图像等)。感谢mkl和Tilman Hausherr的反馈!
TIL - PDF object streams contain useful operator codes!
我的新方法扩展了PDFStreamEngine,并为PDF内容流中找到的每个“Do”(绘图对象)运算符增加了imageCount。使用此方法,图像计数只需几百毫秒:
public class PdfImageCounter extends PDFStreamEngine {
protected int documentImageCount = 0;
public int getDocumentImageCount() {
return documentImageCount;
}
public PdfImageCounter() {
addOperator(new OperatorProcessor() {
@Override
public void process(Operator operator, List<COSBase> arguments) throws IOException {
if (arguments.size() < 1) {
throw new MissingOperandException(operator, arguments);
}
if (isImage(arguments.get(0))) {
documentImageCount++;
}
}
protected Boolean isImage(COSBase base) {
return (base instanceof COSName) &&
context.getResources().isImageXObject((COSName)base);
}
@Override
public String getName() {
return "Do";
}
});
}
}
为每个页面调用它:
protected int countImagesWithProcessor(PDDocument pdf) throws IOException {
StopWatch stopWatch = new StopWatch();
stopWatch.start();
PdfImageCounter counter = new PdfImageCounter();
for (PDPage pdPage : pdf.getPages()) {
counter.processPage(pdPage);
}
stopWatch.stop();
int imageCount = counter.getDocumentImageCount();
log.info("Images counted: time={}s,imageCount={}", stopWatch.getTime() / 1000, imageCount);
return imageCount;
}