Question

我使用python模块 pdftotext 来读取pdf文件。

import pdftotext

with open("lorem_ipsum.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# Iterate over all the pages
for page in pdf:
    print(page)

# Just read the second page
print(pdf.read(2))

# Or read all the text at once
print(pdf.read_all())

以上是可重复性最小的示例，但在我的使用中，pdftotext.PDF中没有可用的方法，例如read_all（）或read（）

with open("/Users/zachary/Downloads/{}R.pdf".format(i), "rb") as f:
    pdf = pdftotext.PDF(f)

pdf.read_all()

AttributeError                            Traceback (most recent call last)
<ipython-input-58-4676e6ace396> in <module>()
      2     pdf = pdftotext.PDF(f)
      3 
----> 4 pdf.read_all()

AttributeError: 'pdftotext.PDF' object has no attribute 'read_all'

有什么问题？

p.s：我只能对pdf实例做什么，

pdf [page_numb]读取每页。它运作良好！

Answer 1

您可以执行以下操作来运行pdftotext：

9:00
9:30
10:00am
11:00a
12:00
1:00p
3:00p
3:30
4:00p
6:00
6:30
7:00
8:00
9:00p
10:00
11:00

这会将pdftotext作为一个单独的进程运行，并将结果存储到名为text的变量中。

可能还有其他方法可以让它发挥作用，但这实际上是让我走上正轨的方法。

希望我的回答很有帮助，

以色列

Answer 2

import pdftext
with open("lorem_ipsum.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# Iterate over all the pages
for page in pdf:
    print(page)

# Just read the second page
print(pdf[2])

Python模块pdftotext：read_all（）方法不可用

2 个答案: