当解析的PDF损坏时,我可以使PyPDF2正常失败吗?

时间:2018-12-09 06:48:53

标签: python pdf pypdf2

我有一个Python应用程序,可以从公共网站上抓取数百个PDF文件,并使用python库PyPDF2

通过它们进行解析

在成功解析的数百个此类文件中,有一个文件让我很伤心。它是18页长。文件名是“ bad.pdf”。您可以here看到它。

这是我的代码,将通过文档进行解析:

$ virtualenv my_env
$ source my_env/bin/activate
(my_env) $ pip install PyPDF2==1.26.0
(my_env) $ python
>>> import PyPDF2
>>> def parse_pdf_doc():
>>>     pdfFileObj = open('bad.pdf', 'rb')
>>>     pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
>>>     for curr_page_num in range(pdfReader.numPages):
>>>         print 'curr_page_num = {}'.format(curr_page_num)
>>>         pageObj = pdfReader.getPage(curr_page_num)
>>>         print '\tPage Retrieved successfully'
>>>         page_text = pageObj.extractText()
>>>         print '\tText extracted successfully'

当我运行此代码时,它将成功解析前九页。但是在第十页上,它挂了。永远:

>>> parse_pdf_doc()
curr_page_num = 0
    Page Retrieved successfully
    Text extracted successfully
curr_page_num = 1
    Page Retrieved successfully
    Text extracted successfully
curr_page_num = 2
    Page Retrieved successfully
    Text extracted successfully
curr_page_num = 3
    Page Retrieved successfully
    Text extracted successfully
curr_page_num = 4
    Page Retrieved successfully
    Text extracted successfully
curr_page_num = 5
    Page Retrieved successfully
    Text extracted successfully
curr_page_num = 6
    Page Retrieved successfully
    Text extracted successfully
curr_page_num = 7
    Page Retrieved successfully
    Text extracted successfully
curr_page_num = 8
    Page Retrieved successfully
    Text extracted successfully
curr_page_num = 9
    Page Retrieved successfully
<... hung here forever ...>

第10页有什么问题?让我们在查看器中打开它。喔,哇:连Google文件都无法解析第10页。因此,该页面肯定存在某些损坏:

enter image description here

但是,我仍然需要PyPDF引发异常或以其他方式失败,而不仅仅是进入无限循环。它杀死了我的工作流程。如何解决PDF文件中此损坏的页面?

1 个答案:

答案 0 :(得分:0)

下面的模板将使您了解如何实现此目标。

from multiprocessing import Process
pdfFileObj = open('bad.pdf', 'rb')
for page in PDFPage.get_pages(pdfFileObj):
                    processTimeout = 20
                    extractTextProcess = Process(target=Function_to_extract_text, args=(pdfObject,page)

open关键字with来保存文件(以节省内存泄漏)