Question

我正在使用Python进行编程，但是如果某些工具/库以另一种语言存在，这将对我有很大帮助，那么我愿意接受建议。

我在数据库中有大量的pdf页面，我正在尝试使这些页面的自动化以与它们建立一些图像识别模型。

这些“ pdf”实际上只是用PDF包装器包裹的PNG图像（大概是这样，它们可以被Adobe Acrobat之类的PDF阅读器读取）。我需要图像格式的pdf才能进入图像识别模型管道。我假设它们是PNG图像，因为当我从浏览器中保存图像时（即右键单击并将图像另存为），结果文件是PNG文件。

在阅读this question from 2010并签出this blog post from 2007之后，我得出结论，必须有一种方法可以从PDF中提取PNG字节数组，而不是将PDF重新转换为新的图片。奇怪的是，我找不到PNG文件头

#Python 3.6

header = bytes([137, 80, 78, 71, 13, 10, 26, 10])
#the resulting header looks like this: b'\x89PNG\r\n\x1a\n'
file.find(header)

这是否意味着嵌入的图像实际上不是PNG图像？

如果没有简单的方法来提取嵌入式图像字节数组，我可以使用哪种工具来自动将每个PDF文件转换为某种图像格式（最好是JPEG，PNG或TIFF）？

编辑：我知道像ImageMagick这样的工具已经存在用于格式转换，但是我确实愿意使用提取方法，以便更多地了解这些文件格式。

Answer 1

pip install pdf2image
pip install pillow
pip install numpy
pip install opencv-python

然后

import numpy as np 
from pdf2image import convert_from_path as read 
import PIL 
import cv2 
#pdf in the form of numpy array to play around with in OpenCV or PIL 
img = np.asarray(read('path to the pdf file')[0])#first page of pdf
cv2.imwrite('path to save the image with the file extension',img)

从PDF提取嵌入式PNG字节流

1 个答案: