我正在使用PDFrw
及其示例之一来提取PFD文件中的唯一图像并将该图像保存为PNG或JPEG文件。
代码太难以理解,我应该将哪些参数传递给find_objects
?
from pdfrw.objects import PdfDict, PdfArray, PdfName
from pdfrw.pdfwriter import user_fmt
def find_objects(source, valid_types=(PdfName.XObject, None),
valid_subtypes=(PdfName.Form, PdfName.Image),
no_follow=(PdfName.Parent,),
isinstance=isinstance, id=id, sorted=sorted,
reversed=reversed, PdfDict=PdfDict):
'''
Find all the objects of a particular kind in a document
or array. Defaults to looking for Form and Image XObjects.
This could be done recursively, but some PDFs
are quite deeply nested, so we do it without
recursion.
Note that we don't know exactly where things appear on pages,
but we aim for a sort order that is (a) mostly in document order,
and (b) reproducible. For arrays, objects are processed in
array order, and for dicts, they are processed in key order.
'''
container = (PdfDict, PdfArray)
# Allow passing a list of pages, or a dict
if isinstance(source, PdfDict):
source = [source]
else:
source = list(source)
visited = set()
source.reverse()
while source:
obj = source.pop()
if not isinstance(obj, container):
continue
myid = id(obj)
if myid in visited:
continue
visited.add(myid)
if isinstance(obj, PdfDict):
if obj.Type in valid_types and obj.Subtype in valid_subtypes:
yield obj
obj = [y for (x, y) in sorted(obj.iteritems())
if x not in no_follow]
else:
# TODO: This forces resolution of any indirect objects in
# the array. It may not be necessary. Don't know if
# reversed() does any voodoo underneath the hood.
# It's cheap enough for now, but might be removeable.
obj and obj[0]
source.extend(reversed(obj))
find_objects('target.pdf')
答案 0 :(得分:2)
我是pdfrw作者,我还没有编写代码:(。
通常,如果我需要这样做,我会使用inkscape。它在命令行模式下运行良好。
pdfrw可能是反向路径的一部分。 img2pdf.py是一个很棒的工具,可以将PDF图像放在页面上,而pdfrw可以将这些图像(一旦它们在PDF中)添加到其他页面。
已编辑添加:
pdfrw 对提取图像非常有用,因为它可以将所有图像放入一个新的PDF中,每页一个图像。请参阅示例目录中的extract.py。
它不能(然而???)然后将图像提取为JPEG,但这对于使用inkscape来说是一项简单的任务,它甚至可以让您轻松地裁剪到实际图像大小。