PDF文档已加载,并以BufferedImage的形式获取扫描的页面内容。当我对此图像进行OCR时,结果显示为空。
代码粘贴在下面
public static void main(String[] args) {
PDDocument document = null;
try {
//mini-cog.pdf Optometry.pdf
document = PDDocument.load(new File("D:\\McLaren\\Optometry.pdf"));
PDPageTree pages = document.getPages();
Iterator iter = pages.iterator();
while (iter.hasNext()) {
PDPage page = (PDPage) iter.next();
PDResources resources = page.getResources();
for (COSName c : resources.getXObjectNames()) {
PDXObject o = resources.getXObject(c);
if (o instanceof PDImageXObject) {
BufferedImage image = ((PDImageXObject) o).getImage();
System.out.println("Width ====>> "+image.getWidth());
System.out.println("Height ====>> "+image.getHeight());
ocr(image);
}
}
} // end while loop
}
catch (IOException ex) {
System.out.println("" + ex);
}
}
public static void ocr(BufferedImage image) {
try {
System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
System.load("C:\\Program Files\\Tesseract-OCR\\gsdll64.dll");
Tesseract tessInst = new Tesseract();
tessInst.setDatapath("D:\\tesseract\\");
tessInst.setLanguage("eng");
String result = tessInst.doOCR(image);
System.out.println(result);
}
catch (TesseractException e) {
e.printStackTrace();
}
}
使用OCR将图像转换为文本后,BufferedImage显示为空。