Question

我有一个打破pyPdf：http://tovotu.de/tests/test.pdf

的pdf文件

这是示例脚本：

from pyPdf import PdfFileWriter, PdfFileReader

outputPdf = PdfFileWriter()

inpdf = open("test.pdf","rb")
inputPdf = PdfFileReader(inpdf)
[outputPdf.addPage(x) for x in inputPdf.pages]

with open("output.pdf","wb") as outpdf:
    outputPdf.write(outpdf)

错误输出在此处：http://pastebin.com/0m38zhjQ

从GitHub使用PyPDF2时，错误是相同的。 pdftk可以像其他任何pdf一样处理这个pdf。请注意，写作失败，但阅读似乎工作正常！

你能否至少指出我导致该错误的pdf的确切部分？解决方法会更好：）

Answer 1

看起来像是PyPDF2中的一个错误。在this section：

if string.startswith(codecs.BOM_UTF16_BE):
    retval = TextStringObject(string.decode("utf-16"))
    retval.autodetect_utf16 = True

它假定任何以（0xFE，0xFF）开头的字符串都可以解码为UTF-16。您的文件包含以该方式开始但随后包含无效UTF-16的字节字符串。

最简单的解决方法是注释掉if并无条件地使用# This is probably a big performance hit here分支。

pyPdf：非法的UTF-16代理

1 个答案: