Question

我尝试使用带有PyPDF2包的Python读取PDF文档。目的是读取pdf中的所有书签，并构造一个词典，以书签的页码作为键，并将书签的标题作为值。

除this文章外，互联网上如何实现它的支持不多。张贴在它的代码不起作用，我不是python的专家来更正它。 PyPDF2的阅读器对象具有名为 outlines 的属性，该属性为您提供所有书签对象的列表，但是没有书签的页码，并且遍历该列表并不困难，因为书签之间没有父/子关系。

我正在共享我的代码下方的内容，以阅读pdf文档并检查概述属性。

import PyPDF2

pdfObj = open('SomeDocument.pdf', 'rb')
readerObj = PyPDF2.PdfFileReader(pdfObj)

print(readerObj.numPages)
print(readerObj.outlines[1][1])

Answer 1

通过将列表相互嵌套来保留父子关系。此示例代码将以缩进目录的形式递归显示书签：

import PyPDF2


def show_tree(bookmark_list, indent=0):
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call with increased indentation
            show_tree(item, indent + 4)
        else:
            print(" " * indent + item.title)


reader = PyPDF2.PdfFileReader("[your filename]")

show_tree(reader.getOutlines())

我不知道如何检索页码。我尝试了一些文件，并且page对象的Destination属性始终是IndirectObject的实例，该实例似乎不包含有关页码的任何信息。

更新：

有一种getDestinationPageNumber方法可从Destination个对象中获取页码。修改代码以创建所需的字典：

import PyPDF2


def bookmark_dict(bookmark_list):
    result = {}
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            result.update(bookmark_dict(item))
        else:
            result[reader.getDestinationPageNumber(item)] = item.title
    return result


reader = PyPDF2.PdfFileReader("[your filename]")

print(bookmark_dict(reader.getOutlines()))

但是，请注意，如果同一页面上有多个书签（字典键必须是唯一的），您将覆盖并丢失一些值。

Answer 2

@myrmica提供正确的答案。该功能需要一些其他错误处理才能处理书签有问题的情况。我还为页码添加了1，因为它们是从零开始的。

import PyPDF2

def bookmark_dict(bookmark_list):
    result = {}
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            result.update(bookmark_dict(item))
        else:
          try:
             result[reader.getDestinationPageNumber(item)+1] = item.title
          except:
             pass
    return result

reader = PyPDF2.PdfFileReader("[your filename]")

print(bookmark_dict(reader.getOutlines()))

Answer 3

PyPDF2已死。使用PyMupdf和type annotations的方法如下：

from typing import Dict

import fitz  # pip install pymupdf


def get_bookmarks(filepath: str) -> Dict[int, str]:
    # WARNING! One page can have multiple bookmarks!
    bookmarks = {}
    with fitz.open(filepath) as doc:
        toc = doc.getToC()  # [[lvl, title, page, …], …]
        for level, title, page in toc:
            bookmarks[page] = title
    return bookmarks


print(get_bookmarks("my.pdf"))

阅读PDF文档中的所有书签，并创建具有书签的页码和标题的字典

3 个答案: