Question

现在我在python中编写一个程序，要求你在.txt文件中打开某个.pdf文件，控制+ A（全部选择），控制C和控制V（复制和粘贴），然后运行程序。

我想知道是否有任何方法可以跳过一个步骤并运行程序而无需执行这一系列步骤，只需要参考程序中的pdf文件。

类似的东西：

##does the procedure above and saves it on a notes.txt file##
FILE_NAME = 'notes.pdf'
read_pdf(FILE_NAME,'notes.txt')

Answer 1

使用slate模块，取决于pdfminer。

安装它：

pip install pdfminer==20131113
pip install https://codeload.github.com/timClicks/slate/zip/master

使用它：

import slate

with open('example.pdf') as fp:
    doc = slate.PDF(fp)

print(len(doc))
print(doc[0])

4
This is a test.

注意：

pdfminer模块没有support Python 3。
您需要在主仓库中安装slate，因为pypi版本的石板已经过时，compatible不是last change pdfminer。

或使用PyPDF2：

安装它：

pip install PyPDF2

使用它：

import PyPDF2

pdf = PyPDF2.PdfFileReader(open('sample.pdf', "rb"))

print(pdf.getNumPages())
print(pdf.getPage(0).extractText())

1
This is a sample.

Answer 2

您可以使用多种方法和许多实用程序自动执行此操作。

Windows上有一个用于GUI自动化的模块：pywinauto，但它只是Windows。

你可以使用像PyPDF2这样具有extractText功能的纯python库。或PDFMiner。

poppler库也有它的python绑定，可以用来提取与PyPDF2非常相似的文本。

您可以从Xpdf中调用来自python的外部程序，如pdftotext。

从pdf文件创建.txt文件

2 个答案: