Question

我想阅读PDF并获取其页面和每个页面大小的列表。我不需要以任何方式操纵它，只需阅读它。

目前正在尝试使用pyPdf，除了获取页面大小的方法外，它还能完成我需要的一切。了解我可能需要迭代，因为页面大小可能会在pdf文档中有所不同。我可以使用另一种libray /方法吗？

我尝试使用PIL，一些在线食谱甚至有d = Image（imagefilename）用法，但它从来没有读过我的任何PDF文件 - 它读取了我扔的所有其他东西 - 甚至一些我不知道PIL可以做的事情

任何指导意见 - 我在Windows 7 64，python25（因为我也做GAE的东西），但我很高兴在Linux或更现代的pythiis中做到这一点。

Answer 1

可以使用PyPDF2：

完成此操作

>>> from PyPDF2 import PdfFileReader
>>> input1 = PdfFileReader(open('example.pdf', 'rb'))
>>> input1.getPage(0).mediaBox
RectangleObject([0, 0, 612, 792])

（以前称为pyPdf，仍然引用其文档。）

Answer 2

for pdfminer python 3.x（pdfminer.six）（没试过python 2.7）：

parser = PDFParser(open(pdfPath, 'rb'))
doc = PDFDocument(parser)
pageSizesList = []
for page in PDFPage.create_pages(doc):
    print(page.mediabox) # <- the media box that is the page size as list of 4 integers x0 y0 x1 y1
    pageSizesList.append(page.mediabox) # <- appending sizes to this list. eventually the pageSizesList will contain list of list corresponding to sizes of each page

Answer 3

使用pdfrw：

>>> from pdfrw import PdfReader
>>> pdf = PdfReader('example.pdf')
>>> pdf.pages[0].MediaBox
['0', '0', '595.2756', '841.8898']

长度以点（1磅= 1/72英寸）给出。格式为['0', '0', width, height]（谢谢，Astrophe！）。

Answer 4

另一种方法是使用popplerqt4

doc = popplerqt4.Poppler.Document.load('/path/to/my.pdf')
qsizedoc = doc.page(0).pageSize()
h = qsizedoc.height() # given in pt,  1pt = 1/72 in
w = qsizedoc.width()

Answer 5

使用PyMuPDF：

>>> import fitz
>>> doc = fitz.open("example.pdf")
>>> page = doc.loadPage(0)
>>> print(page.MediaBox)
Rect(0.0, 0.0, 595.0, 842.0) #format is (0.0, 0.0, width, height) if page is not rotated

在Python中从PDF中提取页面大小

5 个答案: