我想阅读以下pdf文件中的文字。我正在使用pdfbox
版本1.8.8。我收到以下错误。
2014-12-18 15:02:59 WARN XrefTrailerResolver:203 - Did not found XRef object at specified startxref position 4268142
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be cast to org.apache.pdfbox.cos.COSStream
at org.apache.pdfbox.pdmodel.common.COSStreamArray.<init>(COSStreamArray.java:68)
at org.apache.pdfbox.pdmodel.common.PDStream.createFromCOS(PDStream.java:185)
at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:639)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:380)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:275)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:288)
at com.algotree.pdf.test.PdfBoxTest.pdftoText(PdfBoxTest.java:53)
at com.algotree.pdf.test.PdfBoxTest.main(PdfBoxTest.java:71)
是的,我看过很多关于此错误的帖子。我仍然找不到阅读此文件的解决方案。 感谢
这是我的代码:
static String pdftoText(String fileName) throws IOException {
PDFParser parser;
String parsedText = null;;
PDFTextStripper pdfStripper = new PDFTextStripper();
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File(fileName);
if (!file.isFile()) {
System.err.println("File " + fileName + " does not exist.");
return null;
}
try {
parser = new PDFParser(new FileInputStream(file));
} catch (IOException e) {
System.err.println("Unable to open PDF Parser. " + e.getMessage());
return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdfStripper.setSuppressDuplicateOverlappingText(false);
pdDoc = new PDDocument(cosDoc);
int endPage=pdDoc.getPageCount();
if(endPage>300)
endPage=300;
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(endPage);
parsedText = pdfStripper.getText(cosDoc);
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (cosDoc != null)
cosDoc.close();
if (pdDoc != null)
pdDoc.close();
} catch (Exception e) {
e.printStackTrace();
}
}
return parsedText;
}
答案 0 :(得分:1)
这个有效
static String pdftoText(String fileName) throws IOException {
String parsedText = null;;
PDFTextStripper pdfStripper = new PDFTextStripper();
PDDocument pdDoc = null;
File file = new File(fileName);
if (!file.isFile()) {
System.err.println("File " + fileName + " does not exist.");
return null;
}
try {
pdDoc=PDDocument.loadNonSeq(file, null);
} catch (IOException e) {
System.err.println("Unable to open PDF Parser. " + e.getMessage());
return null;
}
try {
pdfStripper = new PDFTextStripper();
int endPage=pdDoc.getPageCount();
if(endPage>300)
endPage=300;
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(endPage);
parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (pdDoc != null)
pdDoc.close();
} catch (Exception e) {
e.printStackTrace();
}
}
return parsedText;
}