Python:无法从PDF提取文本-TypeError

时间:2019-04-24 17:05:42

标签: python pdf pdfminer

我正在使用在以下网站上找到的这段代码:https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/

代码:

import io

from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage

def extract_text_from_pdf(pdf_path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)

    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
            page_interpreter.process_page(page)

        text = fake_file_handle.getvalue()

    # close open handles
    converter.close()
    fake_file_handle.close()

    if text:
        return text

if __name__ == '__main__':
    print(extract_text_from_pdf('test.pdf'))

对于我尝试过的几个PDF文件,代码工作正常。 但是,当我尝试从相应的PDF(test.pdf)中提取文本时,它不起作用。我收到此错误消息

编辑 如果要测试它,可以在这里下载PDF文件: https://pdfs.semanticscholar.org/9843/df40afbf8d8e4e8a9eb821d2a0a157139e62.pdf

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-cbf464387547> in <module>
     28 
     29 if __name__ == '__main__':
---> 30     print(extract_text_from_pdf('test.pdf'))

<ipython-input-7-cbf464387547> in extract_text_from_pdf(pdf_path)
     16                                       caching=True,
     17                                       check_extractable=True):
---> 18             page_interpreter.process_page(page)
     19 
     20         text = fake_file_handle.getvalue()

~/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py in process_page(self, page)
    840     def process_page(self, page):
    841         log.info('Processing page: %r', page)
--> 842         (x0, y0, x1, y1) = page.mediabox
    843         if page.rotate == 90:
    844             ctm = (0, -1, 1, 0, -y0, x1)

TypeError: cannot unpack non-iterable NoneType object

由于PDFMiner没有好的文档,Google对于这个问题没有太多的想法。谁能解释我怎么了?我认为该文件有问题,因为该代码正在与其他PDF文件一起使用。.

感谢您的回答!

0 个答案:

没有答案