如何修复PdfArray无法转换为PRStream

时间:2014-07-08 10:49:20

标签: java itextsharp itext

要阅读pdf文件,我使用了以下代码段来处理iText库。但是,对于某些pdf文档,它会抛出一个异常,如下面的代码所示。我不明白为什么会为某些文档发送此异常,但对于其他一些文档,它不会被抛出。而且,我该如何解决这个问题?

注意:下面的代码用于从pdf中提取文本,即pd fto txt converter

private ArrayList<byte[]> contentOfPdf() {
    PdfReader reader = null;

    PdfDictionary dictionary = null;
    PRIndirectReference reference = null;

    PRStream contentStream = null;
    ArrayList<byte []> byteStream = new ArrayList<byte []>();

    try{
        reader = new PdfReader(this.filename);

        for(int currentPage = 0 ; currentPage <= this.totalPageNumber ; currentPage ++ ) {

            dictionary = reader.getPageN(currentPage);
            reference = (PRIndirectReference) dictionary.get(PdfName.CONTENTS);
/*line 166*/ contentStream = (PRStream) PdfReader.getPdfObject(reference);

            byteStream.add( PdfReader.getStreamBytes(contentStream) );
        }
    } catch(Exception e){
        e.printStackTrace();
    } finally {
        reader.close();
    }

    return byteStream;
}

例外:

java.lang.ClassCastException: com.itextpdf.text.pdf.PdfArray cannot be cast to com.itextpdf.text.pdf.PRStream
at pdfCrawler.retrieveContentOfPdf(CrawlerTask.java:166)
at pdfCrawler.call(CrawlerTask.java:55)
at pdfCrawler..call(CrawlerTask.java:1)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

1 个答案:

答案 0 :(得分:1)

每当您手动浏览PDF时,我强烈建议您在附近附上PDF规范的副本,并查找每个密钥。在您的情况下,如果您查找CONTENTS密钥,您会看到它说:

  

该值应为单个流或流数组。

我不是Java人,但下面的C#代码应该很容易转换为Java,并且应该做你正在寻找的东西:

//Will hold an array of references
PdfArray refs = null;

//If we have an array, use it directly
if (dictionary.Get(PdfName.CONTENTS).IsArray()) {
    refs = dictionary.GetAsArray(PdfName.CONTENTS);
//If we have just a reference, wrap it in a single item array for convenience
} else if (dictionary.Get(PdfName.CONTENTS).IsIndirect()) {
    refs = new PdfArray(dictionary.Get(PdfName.CONTENTS));
//Sanity check, should never happen for conforming PDFs
} else {
    throw new ApplicationException("Unknown CONTENTS types");
}

//Loop through each reference
foreach (var r in refs) {
    //Same code here
    reference = (PRIndirectReference)r;
    contentStream = (PRStream)PdfReader.GetPdfObject(reference);
    byteStream.Add(PdfReader.GetStreamBytes(contentStream));
}