我正在尝试将PDF转换为TEXT。但是我在PDFPage类中有问题。我已经搜索过了。但是我什么也没得到,这给了我下面的错误。我还为python 3.5安装了pdfminer.six,但仍然没有任何解决方案。请帮忙。
代码:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import os
import sys, getopt
#converts pdf, returns its text content as a string
def extract_text_from_pdf(pdf_path):
with open(pdf_path, 'rb') as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=True):
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, codec='utf-8', laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
yield text
# close open handles
converter.close()
fake_file_handle.close()
错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/system/anaconda3/lib/python3.6/site-packages/pdfminer/pdfpage.py", line 5, in <module>
from .pdftypes import PDFObjectNotFound
ImportError: cannot import name 'PDFObjectNotFound'
答案 0 :(得分:0)
在代码的开头添加以下行,然后进行尝试:
from io import StringIO
答案 1 :(得分:0)
卸载pdfminer3k
(如果已安装)
$ pip uninstall pdfminer3k
并使用以下命令安装pdfminer.six
。
$ python -m pip install pdfminer.six