Question

我使用的是Python 3.4，需要从PDF中提取所有文本，然后将其用于文本处理。

我见过的所有答案都提出了Python 2.7的选项。

我在Python 3.4中需要一些东西。

Bonson

Answer 1

您需要安装PyPDF2模块才能在Python 3.4中使用PDF。 PyPDF2无法提取图像，图表或其他媒体，但它可以提取文本并将其作为Python字符串返回。要安装它，请从命令行运行pip install PyPDF2。此模块名称区分大小写，因此请确保以小写形式键入“y”，并将所有其他字符键入大写。

>>> import PyPDF2
>>> pdfFileObj = open('my_file.pdf','rb')     #'rb' for read binary mode
>>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
>>> pdfReader.numPages
56
>>> pageObj = pdfReader.getPage(9)          #'9' is the page number
>>> pageObj.extractText()

last语句返回'my_file.pdf'文档第9页中可用的所有文本。

Answer 2

pdfminer.six（https://github.com/pdfminer/pdfminer.six）也被推荐到其他地方并且旨在支持Python 3.但我不能亲自担保，因为它在安装MacOS时失败了。（这是一个悬而未决的问题，它似乎是最近的一个问题，因此可能会有一个快速修复。）

Answer 3

补充 @Sarah的答案。 PDFMiner是一个不错的选择。我已经使用了很长时间了，直到现在，它在从PDF提取文本内容方面都非常有效。我要做的是创建一个使用CLI client from pdfminer的函数，然后将输出保存到变量中（以后可以在其他地方使用）。我使用的Python版本是3.6，该函数运行良好并且可以完成所需的工作，因此也许可以为您工作：

def pdf_to_text(filepath):
    print('Getting text content for {}...'.format(filepath))
    process = subprocess.Popen(['pdf2txt.py', filepath], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    stdout, stderr = process.communicate()

    if process.returncode != 0 or stderr:
        raise OSError('Executing the command for {} caused an error:\nCode: {}\nOutput: {}\nError: {}'.format(filepath, process.returncode, stdout, stderr))

    return stdout.decode('utf-8')

您当然必须导入子流程模块：import subprocess

Answer 4

slate3k非常适合提取文本。我已经使用Python 3.7.3对一些PDF文件进行了测试，例如，它比PyPDF2准确得多。这是一块石板的叉子，是PDFMiner的包装。这是我正在使用的代码：

import slate3k as slate

with open('Sample.pdf', 'rb') as f:
    doc = slate.PDF(f)

doc
#prints the full document as a list of strings
#each element of the list is a page in the document

doc[0]
#prints the first page of the document

在GitHub上对此评论致谢： https://github.com/mstamy2/PyPDF2/issues/437#issuecomment-400491342

Answer 5

import pdfreader
pdfFileObj = open('/tmp/Test-test-test.pdf','rb')
viewer = SimplePDFViewer(pdfFileObject)
viewer.navigate(1)
viewer.render()
viewer.canvas.strings

在Python 3.4中从PDF文本提取的最佳工具

5 个答案: