PDFBox 2.0.4:XFA文本错误

时间:2017-03-09 02:02:58

标签: pdf pdfbox xfa

我在尝试将PDF(XFA)转换为字符串时遇到以下错误。 当我从PDFBox 1.8.12切换到PDFBox 2.0.4

时,这些错误就开始了

这是日志

Mar 09, 2017 7:16:07 AM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
WARNING: Corrupt object reference at offset 779916
Mar 09, 2017 7:16:07 AM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
WARNING: Corrupt object reference at offset 780049
Mar 09, 2017 7:16:07 AM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
WARNING: Corrupt object reference at offset 780074
java.io.IOException: Unknown dir object c='>' cInt=62 peek='>' peekInt=62 at offset 780074
    at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:951)
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:651)
    at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:866)
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:150)
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:274)
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:207)
    at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:854)
    at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:772)
    at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:741)
    at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:672)
    at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:632)
    at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:217)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:252)

java.io.IOException: Wrong type of referenced length object COSObject{7, 0}: COSDictionary
    at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:907)
    at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:949)
    at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:780)
    at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:741)
    at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:672)
    at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:632)
    at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:217)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:252)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:966)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:922)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:870)

我阅读了迁移并使用了load而不是loadNonSeq,因为现在PDFBox在内部处理它。

有关如何解决这些错误的任何建议。

EDIT Error#1 Error#2

EDIT#2 @TilmanHausherr我检查了你的理论。我在Sublime中打开了文件,删除了开头的额外空格并保存了它。我收到以下错误

    org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
java.io.IOException: java.util.zip.DataFormatException: invalid distance too far back
    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:82)
    at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
    at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:162)
    at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(PDFXrefStreamParser.java:56)
    at org.apache.pdfbox.pdfparser.COSParser.parseXrefStream(COSParser.java:2075)
    at org.apache.pdfbox.pdfparser.COSParser.parseXrefObjStream(COSParser.java:348)
    at org.apache.pdfbox.pdfparser.COSParser.parseXref(COSParser.java:303)
    at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:194)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:252)
    at utils.PDFManager.PDFToText(PDFManager.java:280)
    at processing.charge.CertificateUtils.getCertificateTypeFromFile(CertificateUtils.java:56)
    at processing.charge.CertificateUtils.getCertificateType(CertificateUtils.java:48)
    at processing.Controller.getDocumentType(Controller.java:110)
    at processing.Controller.insertIntoDb(Controller.java:43)
    at Test.main(Test.java:203)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.util.zip.DataFormatException: invalid distance too far back
    at java.util.zip.Inflater.inflateBytes(Native Method)
    at java.util.zip.Inflater.inflate(Inflater.java:259)
    at java.util.zip.Inflater.inflate(Inflater.java:280)
    at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:107)
    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:64)
    ... 19 more
Mar 09, 2017 11:07:22 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
java.io.IOException: java.util.zip.DataFormatException: invalid distance too far back
    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:82)
    at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
    at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:162)
    at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(PDFXrefStreamParser.java:56)
    at org.apache.pdfbox.pdfparser.COSParser.parseXrefStream(COSParser.java:2075)
    at org.apache.pdfbox.pdfparser.COSParser.parseXrefObjStream(COSParser.java:348)
    at org.apache.pdfbox.pdfparser.COSParser.parseXref(COSParser.java:303)
    at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:194)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:252)
    at utils.PDFManager.PDFToText(PDFManager.java:280)
    at processing.charge.CertificateUtils.getCertificateTypeFromFile(CertificateUtils.java:56)
    at processing.charge.CertificateUtils.getCertificateType(CertificateUtils.java:49)
    at processing.Controller.getDocumentType(Controller.java:110)
    at processing.Controller.insertIntoDb(Controller.java:43)
    at Test.main(Test.java:203)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.util.zip.DataFormatException: invalid distance too far back
    at java.util.zip.Inflater.inflateBytes(Native Method)
    at java.util.zip.Inflater.inflate(Inflater.java:259)
    at java.util.zip.Inflater.inflate(Inflater.java:280)
    at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:107)
    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:64)

为了验证你的理论,我在Sublime中打开了另一个文件(工作正常),它有相同的空格,制表符和CR。

Working File

1 个答案:

答案 0 :(得分:2)

正如评论中所讨论的,文件在PDF标题开始之前有空白(CR和TAB)。您可以使用NOTEPAD ++(或任何可以编辑二进制文件的编辑器)删除它们,或者(如果所有文件都有该缺陷)通过编写打开输入流的短代码删除它们,直到您点击&#34;%& #34;然后将所有其余内容从那里复制到输出流。

我还打开了问题PDFBOX-3714

更新: 这已在2.0.5中修复,现已可用。