我想使用pyPdf基于轮廓分割pdf文件,其中轮廓中的每个目的地引用pdf中的不同页面。
示例大纲:
main --> points to page 1 sect1 --> points to page 1 sect2 --> points to page 15 sect3 --> points to page 22
在pyPdf中很容易迭代文档的每个页面或文档大纲中的每个目标;但是,我无法弄清楚如何获得目的地所在的页码。
有人知道如何找到大纲中每个目的地的参考页码吗?
答案 0 :(得分:7)
我明白了:
class Darrell(pyPdf.PdfFileReader):
def getDestinationPageNumbers(self):
def _setup_outline_page_ids(outline, _result=None):
if _result is None:
_result = {}
for obj in outline:
if isinstance(obj, pyPdf.pdf.Destination):
_result[(id(obj), obj.title)] = obj.page.idnum
elif isinstance(obj, list):
_setup_outline_page_ids(obj, _result)
return _result
def _setup_page_id_to_num(pages=None, _result=None, _num_pages=None):
if _result is None:
_result = {}
if pages is None:
_num_pages = []
pages = self.trailer["/Root"].getObject()["/Pages"].getObject()
t = pages["/Type"]
if t == "/Pages":
for page in pages["/Kids"]:
_result[page.idnum] = len(_num_pages)
_setup_page_id_to_num(page.getObject(), _result, _num_pages)
elif t == "/Page":
_num_pages.append(1)
return _result
outline_page_ids = _setup_outline_page_ids(self.getOutlines())
page_id_to_page_numbers = _setup_page_id_to_num()
result = {}
for (_, title), page_idnum in outline_page_ids.iteritems():
result[title] = page_id_to_page_numbers.get(page_idnum, '???')
return result
pdf = Darrell(open(PATH-TO-PDF, 'rb'))
template = '%-5s %s'
print template % ('page', 'title')
for p,t in sorted([(v,k) for k,v in pdf.getDestinationPageNumbers().iteritems()]):
print template % (p+1,t)
答案 1 :(得分:1)
Darrell的类可以稍微修改以生成pdf的多级目录(在pdftk工具箱中以pdftoc的方式。)
我的修改为_setup_page_id_to_num添加了一个参数,一个整数“level”,默认为1.每次调用都会增加该级别。我们不存储结果中的页码,而是存储一对页码和级别。使用返回的结果时,应该应用适当的修改。
我正在使用它来实现“PDF Hacks”基于浏览器的页面一次性文档查看器,其侧边栏目录反映了LaTeX部分,子部分等书签。我正在开发一个共享系统,其中无法安装pdftk但python可用。
答案 2 :(得分:0)
这正是我所寻找的。 Darrell对PdfFileReader的补充应该是PyPDF2的一部分。
我写了一个小方法,使用PyPDF2和sejda-console通过书签分割PDF。在我的情况下,有几个我想要保持在一起的1级部分。这个脚本允许我这样做,并为结果文件提供有意义的名称。
import operator
import os
import subprocess
import sys
import time
import PyPDF2 as pyPdf
# need to have sejda-console installed
# change this to point to your installation
sejda = 'C:\\sejda-console-1.0.0.M2\\bin\\sejda-console.bat'
class Darrell(pyPdf.PdfFileReader):
...
if __name__ == '__main__':
t0= time.time()
# get the name of the file to split as a command line arg
pdfname = sys.argv[1]
# open up the pdf
pdf = Darrell(open(pdfname, 'rb'))
# build list of (pagenumbers, newFileNames)
splitlist = [(1,'FrontMatter')] # Customize name of first section
template = '%-5s %s'
print template % ('Page', 'Title')
print '-'*72
for t,p in sorted(pdf.getDestinationPageNumbers().iteritems(),
key=operator.itemgetter(1)):
# Customize this to get it to split where you want
if t.startswith('Chapter') or \
t.startswith('Preface') or \
t.startswith('References'):
print template % (p+1, t)
# this customizes how files are renamed
new = t.replace('Chapter ', 'Chapter')\
.replace(': ', '-')\
.replace(': ', '-')\
.replace(' ', '_')
splitlist.append((p+1, new))
# call sejda tools and split document
call = sejda
call += ' splitbypages'
call += ' -f "%s"'%pdfname
call += ' -o ./'
call += ' -n '
call += ' '.join([str(p) for p,t in splitlist[1:]])
print '\n', call
subprocess.call(call)
print '\nsejda-console has completed.\n\n'
# rename the split files
for p,t in splitlist:
old ='./%i_'%p + pdfname
new = './' + t + '.pdf'
print 'renaming "%s"\n to "%s"...'%(old, new),
try:
os.remove(new)
except OSError:
pass
try:
os.rename(old, new)
print' succeeded.\n'
except:
print' failed.\n'
print '\ndone. Spliting took %.2f seconds'%(time.time() - t0)
答案 3 :(得分:0)
对@darrell类的小更新,以便能够解析UTF-8轮廓,我将其作为答案发布,因为评论很难阅读。
问题在于pyPdf.pdf.Destination.title
,可能会以两种方式返回:
pyPdf.generic.TextStringObject
pyPdf.generic.ByteStringObject
以便来自_setup_outline_page_ids()
函数的输出还为title
对象返回两种不同的类型,如果大纲标题包含除ASCII之外的任何内容,则会失败UnicodeDecodeError
。
我添加了此代码来解决问题:
if isinstance(title, pyPdf.generic.TextStringObject):
title = title.encode('utf-8')
全班学生:
class PdfOutline(pyPdf.PdfFileReader):
def getDestinationPageNumbers(self):
def _setup_outline_page_ids(outline, _result=None):
if _result is None:
_result = {}
for obj in outline:
if isinstance(obj, pyPdf.pdf.Destination):
_result[(id(obj), obj.title)] = obj.page.idnum
elif isinstance(obj, list):
_setup_outline_page_ids(obj, _result)
return _result
def _setup_page_id_to_num(pages=None, _result=None, _num_pages=None):
if _result is None:
_result = {}
if pages is None:
_num_pages = []
pages = self.trailer["/Root"].getObject()["/Pages"].getObject()
t = pages["/Type"]
if t == "/Pages":
for page in pages["/Kids"]:
_result[page.idnum] = len(_num_pages)
_setup_page_id_to_num(page.getObject(), _result, _num_pages)
elif t == "/Page":
_num_pages.append(1)
return _result
outline_page_ids = _setup_outline_page_ids(self.getOutlines())
page_id_to_page_numbers = _setup_page_id_to_num()
result = {}
for (_, title), page_idnum in outline_page_ids.iteritems():
if isinstance(title, pyPdf.generic.TextStringObject):
title = title.encode('utf-8')
result[title] = page_id_to_page_numbers.get(page_idnum, '???')
return result