Question

我一直在使用Python 3.6.3和wand库设置PDF转换为png和裁剪脚本。

我尝试过Pillow，但是缺少转换部分。我正在尝试提取alpha通道，因为稍后我想将图像馈送到OCR，所以我转向尝试this SO answer中提供的代码。

出现了两个问题：第一个问题是，如果文件很大，我会从终端收到“已杀死”消息。第二个问题是文件看起来有些挑剔，也就是说，通过imagemagick的convert或pdftoppm在命令行中正确转换的文件会引起魔杖错误。

尽管我最关心的是第一个，但我真的很感谢知识渊博的编码人员的检查。我怀疑它可能来自循环的构造方式：

from wand.image import Image
from wand.color import Color


def convert_pdf(filename, path, resolution=300):
    all_pages = Image(filename=path+filename, resolution=resolution)
    for i, page in enumerate(all_pages.sequence):
        with Image(page) as img:
            img.format = 'png'
            img.background_color = Color('white')
            img.alpha_channel = 'remove'

            image_filename = '{}.png'.format(i)
            img.save(filename=path+image_filename)

我注意到脚本在过程结束时输出所有文件，而不是一个接一个地输出，我想这可能会给内存造成不必要的负担，并最终导致SEGFAULT或类似的情况。

感谢您检查我的问题以及任何提示。

Answer 1

是的，您的电话：

all_pages = Image(filename=path+filename, resolution=resolution)

将启动GhostScript进程以将整个PDF呈现为/tmp中的巨大临时PNM文件。然后，魔杖会将大量文件加载到内存中，并在循环时从中分发页面。

借助MagickCore的C API，您可以指定要加载的页面，因此您可以一次渲染一个页面，但是我不知道如何获取Python的wand接口来实现这一点。

您可以尝试pyvips。它通过直接调用libpoppler来增量呈现PDF，因此没有启动和停止的进程，也没有临时文件。

示例：

#!/usr/bin/python3

import sys
import pyvips

def convert_pdf(filename, resolution=300):
    # n is number of pages to load, -1 means load all pages
    all_pages = pyvips.Image.new_from_file(filename, dpi=resolution, n=-1, \
            access="sequential")

    # That'll be RGBA ... flatten out the alpha
    all_pages = all_pages.flatten(background=255)

    # the PDF is loaded as a very tall, thin image, with the pages joined
    # top-to-bottom ... we loop down the image cutting out each page
    n_pages = all_pages.get("n-pages")
    page_width = all_pages.width
    page_height = all_pages.height / n_pages

    for i in range(0, n_pages):
        page = all_pages.crop(0, i * page_height, page_width, page_height) 
        print("writing {}.tif ..".format(i))
        page.write_to_file("{}.tif".format(i))

convert_pdf(sys.argv[1])

在装有this huge PDF的2015年笔记本电脑上，我看到了：

$ /usr/bin/time -f %M:%e ../pages.py ~/pics/Audi_US\ R8_2017-2.pdf 
writing 0.tif ..
writing 1.tif ..
....
writing 20.tif ..
720788:35.95

因此35秒钟即可以300dpi的速度渲染整个文档，并且峰值内存使用量为720MB。

转换大型PDF时，Python / wand代码导致“杀死”

1 个答案: