Python pyPdf问题下载pdf

时间:2018-02-08 15:31:20

标签: python pdf pypdf

我很难从互联网上阅读pdf到python PdfFileReader对象。

我的代码适用于第一个网址,但它不适用于第二个网址,我不知道如何修复它。

我可以看到,在第一个示例中,url引用了 .pdf文件,在第二个url中, pdf作为'应用程序数据'被返回。在HTML身体

所以我认为这可能是问题所在。有没有人知道如何解决它,所以代码也适用于第二个网址?

from pyPdf import PdfFileWriter, PdfFileReader
from io import BytesIO
import requests

def test(url,filename):
  response=requests.get(url)
  pdf_file = BytesIO(response.content)
  existing_pdf = PdfFileReader(pdf_file)

  page = existing_pdf.getPage(0)

  output = PdfFileWriter()
  output.addPage(page)

  outputStream = file(filename, "wb")
  output.write(outputStream)
  outputStream.close()


test('https://s21.q4cdn.com/374334112/files/doc_downloads/test.pdf','works.pdf')
test('https://eservices.minfin.fgov.be/mym-api-rest/finform/pdf/2057','crashes.pdf')

这是我第二次调用测试函数时的堆栈跟踪:

D:\scripts>test.py
Traceback (most recent call last):
  File "D:\scripts\test.py", line 21, in <module>
    test('https://eservices.minfin.fgov.be/mym-api-rest/finform/pdf/2057','crashes.pdf')
  File "D:\scripts\test.py", line 10, in test
    page = existing_pdf.getPage(0)
  File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 450, in getPage
    self._flatten()
  File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 596, in _flatten
    catalog = self.trailer["/Root"].getObject()
  File "C:\Python27\lib\site-packages\pyPdf\generic.py", line 480, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "C:\Python27\lib\site-packages\pyPdf\generic.py", line 165, in getObject
    return self.pdf.getObject(self).getObject()
  File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 655, in getObject
    raise Exception, "file has not been decrypted"
Exception: file has not been decrypted

1 个答案:

答案 0 :(得分:0)

我找到了解决方案。我导入了PyPDF2而不是pyPdf,所以它可能是一个bug。