Question

如何以原始分辨率和格式从pdf文档中提取所有图像？（意思是提取tiff为tiff，jpeg为jpeg等，无需重新采样）。布局不重要，我不在乎源图像是否位于页面上。

我正在使用python 2.7，但如果需要可以使用3.x.

Answer 1

在Python中使用PyPDF2和Pillow库很简单：

import PyPDF2

from PIL import Image

if __name__ == '__main__':
    input1 = PyPDF2.PdfFileReader(open("input.pdf", "rb"))
    page0 = input1.getPage(0)
    xObject = page0['/Resources']['/XObject'].getObject()

    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = "RGB"
            else:
                mode = "P"

            if xObject[obj]['/Filter'] == '/FlateDecode':
                img = Image.frombytes(mode, size, data)
                img.save(obj[1:] + ".png")
            elif xObject[obj]['/Filter'] == '/DCTDecode':
                img = open(obj[1:] + ".jpg", "wb")
                img.write(data)
                img.close()
            elif xObject[obj]['/Filter'] == '/JPXDecode':
                img = open(obj[1:] + ".jp2", "wb")
                img.write(data)
                img.close()

Answer 2

通常在PDF中，图像只是按原样存储。例如，插入了jpg的PDF将在中间某处有一个字节范围，当提取时是一个有效的jpg文件。您可以使用它非常简单地从PDF中提取字节范围。我前段时间写过这篇文章，示例代码为：Extracting JPGs from PDFs。

Answer 3

在Python中使用PyPDF2进行CCITTFaxDecode过滤：

import PyPDF2
import struct

"""
Links:
PDF format: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
CCITT Group 4: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=items
Extract images from pdf: http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python
Extract images coded with CCITTFaxDecode in .net: http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter
TIFF format and tags: http://www.awaresystems.be/imaging/tiff/faq.html
"""


def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4):
    tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
    return struct.pack(tiff_header_struct,
                       b'II',  # Byte order indication: Little indian
                       42,  # Version number (always 42)
                       8,  # Offset to first IFD
                       8,  # Number of tags in IFD
                       256, 4, 1, width,  # ImageWidth, LONG, 1, width
                       257, 4, 1, height,  # ImageLength, LONG, 1, lenght
                       258, 3, 1, 1,  # BitsPerSample, SHORT, 1, 1
                       259, 3, 1, CCITT_group,  # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
                       262, 3, 1, 0,  # Threshholding, SHORT, 1, 0 = WhiteIsZero
                       273, 4, 1, struct.calcsize(tiff_header_struct),  # StripOffsets, LONG, 1, len of header
                       278, 4, 1, height,  # RowsPerStrip, LONG, 1, lenght
                       279, 4, 1, img_size,  # StripByteCounts, LONG, 1, size of image
                       0  # last IFD
                       )

pdf_filename = 'scan.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
for i in range(0, cond_scan_reader.getNumPages()):
    page = cond_scan_reader.getPage(i)
    xObject = page['/Resources']['/XObject'].getObject()
    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            """
            The  CCITTFaxDecode filter decodes image data that has been encoded using
            either Group 3 or Group 4 CCITT facsimile (fax) encoding. CCITT encoding is
            designed to achieve efficient compression of monochrome (1 bit per pixel) image
            data at relatively low resolutions, and so is useful only for bitmap image data, not
            for color images, grayscale images, or general data.

            K < 0 --- Pure two-dimensional encoding (Group 4)
            K = 0 --- Pure one-dimensional encoding (Group 3, 1-D)
            K > 0 --- Mixed one- and two-dimensional encoding (Group 3, 2-D)
            """
            if xObject[obj]['/Filter'] == '/CCITTFaxDecode':
                if xObject[obj]['/DecodeParms']['/K'] == -1:
                    CCITT_group = 4
                else:
                    CCITT_group = 3
                width = xObject[obj]['/Width']
                height = xObject[obj]['/Height']
                data = xObject[obj]._data  # sorry, getData() does not work for CCITTFaxDecode
                img_size = len(data)
                tiff_header = tiff_header_for_CCITT(width, height, img_size, CCITT_group)
                img_name = obj[1:] + '.tiff'
                with open(img_name, 'wb') as img_file:
                    img_file.write(tiff_header + data)
                #
                # import io
                # from PIL import Image
                # im = Image.open(io.BytesIO(tiff_header + data))
pdf_file.close()

Answer 4

您可以使用模块PyMuPDF。这会将所有图像输出为.png文件，但开箱即用，速度很快。

import fitz
doc = fitz.open("file.pdf")
for i in range(len(doc)):
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:       # this is GRAY or RGB
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

see here for more resources

Answer 5

Libpoppler附带了一个名为“pdfimages”的工具，它正是这样做的。

（在ubuntu系统上，它位于poppler-utils包中）

http://poppler.freedesktop.org/

http://en.wikipedia.org/wiki/Pdfimages

Windows二进制文件：http://blog.alivate.com.au/poppler-windows/

Answer 6

我从@sylvain的代码开始存在一些缺陷，例如getData的异常NotImplementedError: unsupported filter /DCTDecode，或者代码无法在某些页面中找到图像的事实，因为它们处于比页面更深的层次。

有我的代码：

import PyPDF2

from PIL import Image

import sys
from os import path
import warnings
warnings.filterwarnings("ignore")

number = 0

def recurse(page, xObject):
    global number

    xObject = xObject['/Resources']['/XObject'].getObject()

    for obj in xObject:

        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj]._data
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = "RGB"
            else:
                mode = "P"

            imagename = "%s - p. %s - %s"%(abspath[:-4], p, obj[1:])

            if xObject[obj]['/Filter'] == '/FlateDecode':
                img = Image.frombytes(mode, size, data)
                img.save(imagename + ".png")
                number += 1
            elif xObject[obj]['/Filter'] == '/DCTDecode':
                img = open(imagename + ".jpg", "wb")
                img.write(data)
                img.close()
                number += 1
            elif xObject[obj]['/Filter'] == '/JPXDecode':
                img = open(imagename + ".jp2", "wb")
                img.write(data)
                img.close()
                number += 1
        else:
            recurse(page, xObject[obj])



try:
    _, filename, *pages = sys.argv
    *pages, = map(int, pages)
    abspath = path.abspath(filename)
except BaseException:
    print('Usage :\nPDF_extract_images file.pdf page1 page2 page3 …')
    sys.exit()


file = PyPDF2.PdfFileReader(open(filename, "rb"))

for p in pages:    
    page0 = file.getPage(p-1)
    recurse(p, page0)

print('%s extracted images'% number)

Answer 7

经过一番搜索后，我发现以下脚本与我的PDF版本配合得非常好。它只处理JPG，但它与我的无保护文件完美配合。也不需要任何外部库。

不要吝啬，这个剧本来自Ned Batchelder，而不是我。 Python3代码：从pdf中提取jpg。快速而肮脏

import sys

with open(sys.argv[1],"rb") as file:
    file.seek(0)
    pdf = file.read()

startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find(b"stream", i)
    if istream < 0:
        break
    istart = pdf.find(startmark, istream, istream + 20)
    if istart < 0:
        i = istream + 20
        continue
    iend = pdf.find(b"endstream", istart)
    if iend < 0:
        raise Exception("Didn't find end of stream!")
    iend = pdf.find(endmark, iend - 20)
    if iend < 0:
        raise Exception("Didn't find end of JPG!")

    istart += startfix
    iend += endfix
    print("JPG %d from %d to %d" % (njpg, istart, iend))
    jpg = pdf[istart:iend]
    with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
        jpgfile.write(jpg)

    njpg += 1
    i = iend

Answer 8

我更喜欢minecart，因为它非常易于使用。下面的代码段显示了如何从pdf中提取图像：

#pip install minecart
import minecart

pdffile = open('Invoices.pdf', 'rb')
doc = minecart.Document(pdffile)

page = doc.get_page(0) # getting a single page

#iterating through all pages
for page in doc.iter_pages():
    im = page.images[0].as_pil()  # requires pillow
    display(im)

Answer 9

这是我从2019年开始的版本，该版本以递归方式从PDF获取所有图像并使用PIL读取它们。与Python 2/3兼容。我还发现zlib有时会压缩PDF中的图像，因此我的代码支持解压缩。

#!/usr/bin/env python3
try:
    from StringIO import StringIO
except ImportError:
    from io import BytesIO as StringIO
from PIL import Image
from PyPDF2 import PdfFileReader, generic
import zlib


def get_color_mode(obj):

    try:
        cspace = obj['/ColorSpace']
    except KeyError:
        return None

    if cspace == '/DeviceRGB':
        return "RGB"
    elif cspace == '/DeviceCMYK':
        return "CMYK"
    elif cspace == '/DeviceGray':
        return "P"

    if isinstance(cspace, generic.ArrayObject) and cspace[0] == '/ICCBased':
        color_map = obj['/ColorSpace'][1].getObject()['/N']
        if color_map == 1:
            return "P"
        elif color_map == 3:
            return "RGB"
        elif color_map == 4:
            return "CMYK"


def get_object_images(x_obj):
    images = []
    for obj_name in x_obj:
        sub_obj = x_obj[obj_name]

        if '/Resources' in sub_obj and '/XObject' in sub_obj['/Resources']:
            images += get_object_images(sub_obj['/Resources']['/XObject'].getObject())

        elif sub_obj['/Subtype'] == '/Image':
            zlib_compressed = '/FlateDecode' in sub_obj.get('/Filter', '')
            if zlib_compressed:
               sub_obj._data = zlib.decompress(sub_obj._data)

            images.append((
                get_color_mode(sub_obj),
                (sub_obj['/Width'], sub_obj['/Height']),
                sub_obj._data
            ))

    return images


def get_pdf_images(pdf_fp):
    images = []
    try:
        pdf_in = PdfFileReader(open(pdf_fp, "rb"))
    except:
        return images

    for p_n in range(pdf_in.numPages):

        page = pdf_in.getPage(p_n)

        try:
            page_x_obj = page['/Resources']['/XObject'].getObject()
        except KeyError:
            continue

        images += get_object_images(page_x_obj)

    return images


if __name__ == "__main__":

    pdf_fp = "test.pdf"

    for image in get_pdf_images(pdf_fp):
        (mode, size, data) = image
        try:
            img = Image.open(StringIO(data))
        except Exception as e:
            print ("Failed to read image with PIL: {}".format(e))
            continue
        # Do whatever you want with the image

Answer 10

我在我的服务器上安装了ImageMagick，然后通过Popen运行了命令行调用：

 #!/usr/bin/python

 import sys
 import os
 import subprocess
 import settings

 IMAGE_PATH = os.path.join(settings.MEDIA_ROOT , 'pdf_input' )

 def extract_images(pdf):
     output = 'temp.png'
     cmd = 'convert ' + os.path.join(IMAGE_PATH, pdf) + ' ' + os.path.join(IMAGE_PATH, output)
     subprocess.Popen(cmd.split(), stderr=subprocess.STDOUT, stdout=subprocess.PIPE)

这将为每个页面创建一个图像，并将它们存储为temp-0.png，temp-1.png .... 如果您获得的图片只包含图片且没有文字，那么这只是“提取”。

Answer 11

更简单的解决方案：

使用poppler-utils包。要安装它，请使用自制软件（自制软件是特定于MacOS的，但您可以在此处找到适用于Widows或Linux的poppler-utils软件包：https://poppler.freedesktop.org/）。下面的第一行代码使用自制软件安装poppler-utils。安装第二行（从命令行运行）后，从PDF文件中提取图像并命名为＆＃34;图像*＆＃34;。要在Python中运行此程序，请使用os或subprocess模块。第三行是使用os模块的代码，下面是子进程的示例（python 3.5或更高版本用于run（）函数）。更多信息：https://www.cyberciti.biz/faq/easily-extract-images-from-pdf-file/

brew install poppler

pdfimages file.pdf image

import os
os.system('pdfimages file.pdf image')

或

import subprocess
subprocess.run('pdfimages file.pdf image', shell=True)

Answer 12

好几个星期以来，我一直在为此苦苦挣扎，许多答案都帮助了我，但是始终缺少一些东西，显然这里没有人遇到过 jbig2编码图像的问题。 / p>

在我要扫描的一堆PDF中，用jbig2编码的图像非常受欢迎。

据我了解，有很多复印/扫描机可以扫描纸张并将其转换为包含jbig2编码图像的PDF文件。

因此，经过许多天的测试，很久以前就决定寻求dkagedal在这里提出的答案。

这是我在linux上的分步指南：（如果您有其他操作系统，我建议使用 linux docker ，它将变得更加容易。）

第一步：

apt-get install poppler-utils

然后，我能够像这样运行名为pdfimages的命令行工具：

pdfimages -all myfile.pdf ./images_found/

使用上述命令，您将能够提取 myfile.pdf中包含的所有图像，并将它们保存在images_found中（必须先创建images_found）

在列表中，您会找到几种图像类型，即png，jpg，tiff；所有这些都可以通过任何图形工具轻松阅读。

然后，您将拥有一些名为-145.jb2e和-145.jb2g的文件。

这2个文件包含一个用jbig2编码的图像，保存在2个不同的文件中，一个用于标题，另一个用于数据

再次，我迷失了很多天，试图找出如何将这些文件转换为可读的文件，最后我遇到了名为jbig2dec的工具

因此，首先需要安装此魔术工具：

apt-get install jbig2dec

然后您可以运行：

jbig2dec -t png -145.jb2g -145.jb2e

您最终将能够将所有提取的图像转换成有用的东西。

祝你好运！

Answer 13

我对自己的程序进行了此操作，发现最好使用的库是PyMuPDF。它使您可以找出每页上每个图像的“外部参照”编号，并使用它们从PDF中提取原始图像数据。

import fitz
from PIL import Image
import io

filePath = "path/to/file.pdf"
#opens doc using PyMuPDF
doc = fitz.Document(filePath)

#loads the first page
page = doc.loadPage(0)

#[First image on page described thru a list][First attribute on image list: xref n], check PyMuPDF docs under getImageList()
xref = page.getImageList()[0][0]

#gets the image as a dict, check docs under extractImage 
baseImage = doc.extractImage(xref)

#gets the raw string image data from the dictionary and wraps it in a BytesIO object before using PIL to open it
image = Image.open(io.BytesIO(baseImage['image']))

#Displays image for good measure
image.show()

不过绝对要检查文档。

Answer 14

PikePDF 可以用很少的代码做到这一点：

from pikepdf import Pdf, PdfImage

filename = "sample-in.pdf"
example = Pdf.open(filename)

for i, page in enumerate(example.pages):
    for j, (name, raw_image) in enumerate(page.images.items()):
        image = PdfImage(raw_image)
        out = image.extract_to(fileprefix=f"{filename}-page{i:03}-img{j:03}")

extract_to 将根据图像的方式自动选择文件扩展名在 PDF 中编码。

如果需要，您还可以在提取图像时打印有关图像的一些详细信息：

        # Optional: print info about image
        w = raw_image.stream_dict.Width
        h = raw_image.stream_dict.Height
        f = raw_image.stream_dict.Filter
        size = raw_image.stream_dict.Length

        print(f"Wrote {name} {w}x{h} {f} {size:,}B {image.colorspace} to {out}")

可以打印类似的东西

Wrote /Im1 150x150 /DCTDecode 5,952B /ICCBased to sample2.pdf-page000-img000.jpg
Wrote /Im10 32x32 /FlateDecode 36B /ICCBased to sample2.pdf-page000-img001.png
...

见the docs 您可以对图像执行更多操作，包括在 PDF 文件中替换它们。

Answer 15

您也可以在Ubuntu中使用pdfimages命令。

使用以下命令安装poppler lib。

sudo apt install poppler-utils

sudo apt-get install python-poppler

pdfimages file.pdf image

创建的文件列表为（例如，.pdf中有两个图像）

image-000.png
image-001.png

有效！现在，您可以使用subprocess.run从python运行它。

Answer 16

截至2019年2月，@ sylvain提供的解决方案（至少在我的设置中）在没有进行小幅修改的情况下就无法使用：xObject[obj]['/Filter']不是值，而是列表，因此是为了制作脚本工作中，我不得不按如下方式修改格式检查：

import PyPDF2, traceback

from PIL import Image

input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()
print nPages

for i in range(nPages) :
    print i
    page0 = input1.getPage(i)
    try :
        xObject = page0['/Resources']['/XObject'].getObject()
    except : xObject = []

    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            try :
                if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                    mode = "RGB"
                elif xObject[obj]['/ColorSpace'] == '/DeviceCMYK':
                    mode = "CMYK"
                    # will cause errors when saving
                else:
                    mode = "P"

                fn = 'p%03d-%s' % (i + 1, obj[1:])
                print '\t', fn
                if '/FlateDecode' in xObject[obj]['/Filter'] :
                    img = Image.frombytes(mode, size, data)
                    img.save(fn + ".png")
                elif '/DCTDecode' in xObject[obj]['/Filter']:
                    img = open(fn + ".jpg", "wb")
                    img.write(data)
                    img.close()
                elif '/JPXDecode' in xObject[obj]['/Filter'] :
                    img = open(fn + ".jp2", "wb")
                    img.write(data)
                    img.close()
                elif '/LZWDecode' in xObject[obj]['/Filter'] :
                    img = open(fn + ".tif", "wb")
                    img.write(data)
                    img.close()
                else :
                    print 'Unknown format:', xObject[obj]['/Filter']
            except :
                traceback.print_exc()

Answer 17

使用 pyPDF2 阅读帖子后。

使用@sylvain的代码NotImplementedError: unsupported filter /DCTDecode时出现的错误必须来自方法.getData()：@Alex Paramonov通过使用._data来解决。

到目前为止，我只遇到过“ DCTDecode”案例，但是我正在共享改编的代码，其中包括来自不同帖子的评论：从zilb通过@Alex Paramonov，sub_obj['/Filter']是列表，通过@mxl。

希望它可以帮助pyPDF2用户。遵循代码：

    import sys
    import PyPDF2, traceback
    import zlib
    try:
        from PIL import Image
    except ImportError:
        import Image

    pdf_path = 'path_to_your_pdf_file.pdf'
    input1 = PyPDF2.PdfFileReader(open(pdf_path, "rb"))
    nPages = input1.getNumPages()

    for i in range(nPages) :
        page0 = input1.getPage(i)

        if '/XObject' in page0['/Resources']:
            try:
                xObject = page0['/Resources']['/XObject'].getObject()
            except :
                xObject = []

            for obj_name in xObject:
                sub_obj = xObject[obj_name]
                if sub_obj['/Subtype'] == '/Image':
                    zlib_compressed = '/FlateDecode' in sub_obj.get('/Filter', '')
                    if zlib_compressed:
                       sub_obj._data = zlib.decompress(sub_obj._data)

                    size = (sub_obj['/Width'], sub_obj['/Height'])
                    data = sub_obj._data#sub_obj.getData()
                    try :
                        if sub_obj['/ColorSpace'] == '/DeviceRGB':
                            mode = "RGB"
                        elif sub_obj['/ColorSpace'] == '/DeviceCMYK':
                            mode = "CMYK"
                            # will cause errors when saving (might need convert to RGB first)
                        else:
                            mode = "P"

                        fn = 'p%03d-%s' % (i + 1, obj_name[1:])
                        if '/Filter' in sub_obj:
                            if '/FlateDecode' in sub_obj['/Filter']:
                                img = Image.frombytes(mode, size, data)
                                img.save(fn + ".png")
                            elif '/DCTDecode' in sub_obj['/Filter']:
                                img = open(fn + ".jpg", "wb")
                                img.write(data)
                                img.close()
                            elif '/JPXDecode' in sub_obj['/Filter']:
                                img = open(fn + ".jp2", "wb")
                                img.write(data)
                                img.close()
                            elif '/CCITTFaxDecode' in sub_obj['/Filter']:
                                img = open(fn + ".tiff", "wb")
                                img.write(data)
                                img.close()
                            elif '/LZWDecode' in sub_obj['/Filter'] :
                                img = open(fn + ".tif", "wb")
                                img.write(data)
                                img.close()
                            else :
                                print('Unknown format:', sub_obj['/Filter'])
                        else:
                            img = Image.frombytes(mode, size, data)
                            img.save(fn + ".png")
                    except:
                        traceback.print_exc()
        else:
            print("No image found for page %d" % (i + 1))

Answer 18

我在PyPDFTK here中添加了所有这些内容。

我自己的贡献是处理/Indexed文件：

for obj in xObject:
    if xObject[obj]['/Subtype'] == '/Image':
        size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
        color_space = xObject[obj]['/ColorSpace']
        if isinstance(color_space, pdf.generic.ArrayObject) and color_space[0] == '/Indexed':
            color_space, base, hival, lookup = [v.getObject() for v in color_space] # pg 262
        mode = img_modes[color_space]

        if xObject[obj]['/Filter'] == '/FlateDecode':
            data = xObject[obj].getData()
            img = Image.frombytes(mode, size, data)
            if color_space == '/Indexed':
                img.putpalette(lookup.getData())
                img = img.convert('RGB')
            img.save("{}{:04}.png".format(filename_prefix, i))

请注意，当找到/Indexed个文件时，您不能只将/ColorSpace与字符串进行比较，因为它是ArrayObject。因此，我们必须检查数组并检索索引调色板（代码中的lookup）并将其设置在PIL Image对象中，否则它将保持未初始化（零）并且整个图像显示为黑色。

我的第一直觉是将它们保存为GIF（这是一种索引格式），但我的测试结果表明PNG更小，看起来也是一样。

使用福昕阅读器PDF打印机打印到PDF时，我发现了这些类型的图像。

Answer 19

尝试以下代码。它将从pdf中提取所有图像。

    import sys
    import PyPDF2
    from PIL import Image
    pdf=sys.argv[1]
    print(pdf)
    input1 = PyPDF2.PdfFileReader(open(pdf, "rb"))
    for x in range(0,input1.numPages):
        xObject=input1.getPage(x)
        xObject = xObject['/Resources']['/XObject'].getObject()
        for obj in xObject:
            if xObject[obj]['/Subtype'] == '/Image':
                size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                print(size)
                data = xObject[obj]._data
                #print(data)
                print(xObject[obj]['/Filter'])
                if xObject[obj]['/Filter'][0] == '/DCTDecode':
                    img_name=str(x)+".jpg"
                    print(img_name)
                    img = open(img_name, "wb")
                    img.write(data)
                    img.close()
        print(str(x)+" is done")

Answer 20

首先安装pdf2image

点安装pdf2image == 1.14.0

按照下面的代码从PDF中提取页面。

file_path="file path of PDF"
info = pdfinfo_from_path(file_path, userpw=None, poppler_path=None)
maxPages = info["Pages"]
image_counter = 0
if maxPages > 10:
    for page in range(1, maxPages, 10):
        pages = convert_from_path(file_path, dpi=300, first_page=page, 
                last_page=min(page+10-1, maxPages))
        for page in pages:
            page.save(image_path+'/' + str(image_counter) + '.png', 'PNG')
            image_counter += 1
else:
    pages = convert_from_path(file_path, 300)
    for i, j in enumerate(pages):
        j.save(image_path+'/' + str(i) + '.png', 'PNG')

希望它可以帮助编码人员根据PDF页面轻松地将PDF文件转换为图像。

在python中从PDF中提取图像而不重新采样？

20 个答案: