Question

我正在分析NLP会议。我需要使用python从在线托管的pdf中提取页面数。例如： pdf的来源是“ https://www.aclweb.org/anthology/E91-1002.pdf” 输出应为6。

Answer 1

我会scrape，然后用PyPdf2提取信息。

Answer 2

按照Darjusch的建议，使用PyPDF2。

PdfFileReader不占用原始字节，因此您需要创建一个file like对象，该对象以pdf文件的字节初始化。

import PyPDF2, io, requests

response = requests.get("https://www.aclweb.org/anthology/E91-1002.pdf")
pdf_file = io.BytesIO(response.content) # response being a requests Response object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
num_pages = pdf_reader.numPages

或一行：

num_pages = PyPDF2.PdfFileReader(io.BytesIO(response.content)).numPages

num_pages为6。

如何在python中计算在线pdf的页数？

2 个答案: