我正在使用在以下网站上找到的这段代码:https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/
代码:
import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
def extract_text_from_pdf(pdf_path):
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
with open(pdf_path, 'rb') as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close()
if text:
return text
if __name__ == '__main__':
print(extract_text_from_pdf('test.pdf'))
对于我尝试过的几个PDF文件,代码工作正常。 但是,当我尝试从相应的PDF(test.pdf)中提取文本时,它不起作用。我收到此错误消息:
编辑 如果要测试它,可以在这里下载PDF文件: https://pdfs.semanticscholar.org/9843/df40afbf8d8e4e8a9eb821d2a0a157139e62.pdf
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-cbf464387547> in <module>
28
29 if __name__ == '__main__':
---> 30 print(extract_text_from_pdf('test.pdf'))
<ipython-input-7-cbf464387547> in extract_text_from_pdf(pdf_path)
16 caching=True,
17 check_extractable=True):
---> 18 page_interpreter.process_page(page)
19
20 text = fake_file_handle.getvalue()
~/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py in process_page(self, page)
840 def process_page(self, page):
841 log.info('Processing page: %r', page)
--> 842 (x0, y0, x1, y1) = page.mediabox
843 if page.rotate == 90:
844 ctm = (0, -1, 1, 0, -y0, x1)
TypeError: cannot unpack non-iterable NoneType object
由于PDFMiner没有好的文档,Google对于这个问题没有太多的想法。谁能解释我怎么了?我认为该文件有问题,因为该代码正在与其他PDF文件一起使用。.
感谢您的回答!