Question

我如何在python中阅读pdf？ 我知道将其转换为文本的一种方式，但我想直接从pdf中阅读内容。

任何人都可以解释python中哪个模块最适合pdf提取

Answer 1

您可以使用PyPDF2包

#install pyDF2
pip install PyPDF2

# importing all the required modules
import PyPDF2

# creating an object 
file = open('example.pdf', 'rb')

# creating a pdf reader object
fileReader = PyPDF2.PdfFileReader(file)

# print the number of pages in pdf file
print(fileReader.numPages)

请遵循此文档http://pythonhosted.org/PyPDF2/

Answer 2

您可以在python中使用textract模块

<强> Textract

安装

pip install textract

读取pdf

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

详细信息 Textract

Answer 3

尝试PyPDF2。

这里有一个很好的教程：https://automatetheboringstuff.com/chapter13/

我如何在python中阅读pdf？

3 个答案: