要阅读pdf文件,我使用了以下代码段来处理iText库。但是,对于某些pdf文档,它会抛出一个异常,如下面的代码所示。我不明白为什么会为某些文档发送此异常,但对于其他一些文档,它不会被抛出。而且,我该如何解决这个问题?
注意:下面的代码用于从pdf中提取文本,即pd fto txt converter
private ArrayList<byte[]> contentOfPdf() {
PdfReader reader = null;
PdfDictionary dictionary = null;
PRIndirectReference reference = null;
PRStream contentStream = null;
ArrayList<byte []> byteStream = new ArrayList<byte []>();
try{
reader = new PdfReader(this.filename);
for(int currentPage = 0 ; currentPage <= this.totalPageNumber ; currentPage ++ ) {
dictionary = reader.getPageN(currentPage);
reference = (PRIndirectReference) dictionary.get(PdfName.CONTENTS);
/*line 166*/ contentStream = (PRStream) PdfReader.getPdfObject(reference);
byteStream.add( PdfReader.getStreamBytes(contentStream) );
}
} catch(Exception e){
e.printStackTrace();
} finally {
reader.close();
}
return byteStream;
}
例外:
java.lang.ClassCastException: com.itextpdf.text.pdf.PdfArray cannot be cast to com.itextpdf.text.pdf.PRStream
at pdfCrawler.retrieveContentOfPdf(CrawlerTask.java:166)
at pdfCrawler.call(CrawlerTask.java:55)
at pdfCrawler..call(CrawlerTask.java:1)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
答案 0 :(得分:1)
每当您手动浏览PDF时,我强烈建议您在附近附上PDF规范的副本,并查找每个密钥。在您的情况下,如果您查找CONTENTS
密钥,您会看到它说:
该值应为单个流或流数组。
我不是Java人,但下面的C#代码应该很容易转换为Java,并且应该做你正在寻找的东西:
//Will hold an array of references
PdfArray refs = null;
//If we have an array, use it directly
if (dictionary.Get(PdfName.CONTENTS).IsArray()) {
refs = dictionary.GetAsArray(PdfName.CONTENTS);
//If we have just a reference, wrap it in a single item array for convenience
} else if (dictionary.Get(PdfName.CONTENTS).IsIndirect()) {
refs = new PdfArray(dictionary.Get(PdfName.CONTENTS));
//Sanity check, should never happen for conforming PDFs
} else {
throw new ApplicationException("Unknown CONTENTS types");
}
//Loop through each reference
foreach (var r in refs) {
//Same code here
reference = (PRIndirectReference)r;
contentStream = (PRStream)PdfReader.GetPdfObject(reference);
byteStream.Add(PdfReader.GetStreamBytes(contentStream));
}