Question

我正在尝试将PDF转换为图像，以进一步进行Tesseract。当我使用cmd进行转换时，它会起作用：

magick convert a.pdf b.png

但是当我尝试使用Python进行相同操作时不起作用：

from wand.image import Image
with Image (filename='a.pdf') as img:
    img.save(filename = 'sample.png')`

我得到的错误是：

unable to read image data D:/Users/UserName/AppData/Local/Temp/magick-4908Cq41DDA5FxlX1 @ error/pnm.c/ReadPNMImage/1346

我还安装了ghostscipt，但错误仍然存在。

编辑：

我采用了下面答复中提供的代码，并对其进行了修改以读取所有页面。原始问题仍然存在，下面的代码使用pdf2image：

from pdf2image import convert_from_path
import os
pdf_dir = "D:/Users/UserName/Desktop/scraping"
for pdf_file in os.listdir(pdf_dir):
    if pdf_file.endswith(".pdf"):
        pages = convert_from_path(pdf_file, 300)
        pdf_name = pdf_file[:-4]

        for page in pages:
            page.save("%s-page%d.jpg" % (pdf_name, pages.index(page)), "JPEG")

Answer 1

您可以使用pdf2image来代替wand.image。像这样安装它：

pip install pdf2image

这是遍历PDF中每个页面的代码，最终将它们转换为JPEG：

import os
import tempfile
from pdf2image import convert_from_path

filename = 'target.pdf'

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path(filename, output_folder=path, last_page=1, first_page =0)

base_filename = os.path.splitext(os.path.basename(filename))[0] + '.jpg'     

save_dir = 'dir'

for page in images_from_path:
    page.save(os.path.join(save_dir, base_filename), 'JPEG')

从PDF转换为图像时无法读取图像数据

1 个答案: