PyPDF2写入损坏的文件

时间:2020-07-08 17:27:35

标签: python-3.x ubuntu-18.04 pypdf2

PyPDF2有一些问题-特别是在分割和重写文件方面!

我正在我的ubuntu服务器上打开一个文件,将其拆分为单独的页面(最多3个页面),并写入文件系统(然后放入S3)。写入文件时不会引发错误,但是从S3下载时无法打开它,并且正如您将在下面看到的,无法在服务器上打开。

有什么想法吗?

  inputpdf = PdfFileReader(open(fi, 'rb'))

  print('breaking file into %s pages' % inputpdf.numPages) # 17 pages

       for i in range(min(3,inputpdf.numPages)):
            output = PdfFileWriter()
            output.addPage(inputpdf.getPage(i))
            new_fi = fi[:-4]+'_page_%s.pdf' % i # fi = ./deals/temp_files/test_experian.pdf
            with open(new_fi, 'wb') as outputStream:
                 output.write(outputStream) # successfully writes all files
                 pdf_check = open(new_fi, 'rb')
                 print('opened PDF')
                 read_pdf = PdfFileReader(pdf_check) # "error throw -> EOF market not found"
                 print('loaded PDF')
                 page_content = read_pdf.getPage(0).extractText()
                 print(page_content.encode('utf-8'))

1 个答案:

答案 0 :(得分:0)

错误原因: 尝试以写入模式读取文件

解决方案:

 for i in range(min(3,inputpdf.numPages)):
     output = PdfFileWriter()
     output.addPage(inputpdf.getPage(i))
     new_fi = fi[:-4]+'_page_%s.pdf' % i
     with open(new_fi, 'wb') as outputStream:
         output.write(outputStream)
     pdf_check = open(new_fi, 'rb')
     print('opened PDF')
     read_pdf = PdfFileReader(pdf_check)
     print('loaded PDF')
     page_content = read_pdf.getPage(0).extractText()
     print(page_content.encode('utf-8'))

通过使用

with open(new_fi, 'wb') as outputStream

您以写模式创建文件指针。 默认情况下,文件仅在该“ with”块的末尾关闭。
因此,当您尝试阅读时,read_pdf会出现错误,因为在打开文件以再次读取之前未关闭文件。