使用Tesseract OCR将PDF中的扫描图像转换为文本

时间:2019-07-15 10:19:44

标签: java ocr tesseract

PDF文档已加载,并以BufferedImage的形式获取扫描的页面内容。当我对此图像进行OCR时,结果显示为空。

代码粘贴在下面

public static void main(String[] args) {
    PDDocument document = null;
    try {
        //mini-cog.pdf Optometry.pdf
        document = PDDocument.load(new File("D:\\McLaren\\Optometry.pdf")); 
        PDPageTree pages = document.getPages();
        Iterator iter = pages.iterator();
        while (iter.hasNext()) {
            PDPage page = (PDPage) iter.next();
            PDResources resources = page.getResources();
            for (COSName c : resources.getXObjectNames()) {
                PDXObject o = resources.getXObject(c);
                if (o instanceof PDImageXObject) {
                    BufferedImage image = ((PDImageXObject) o).getImage();
                    System.out.println("Width ====>> "+image.getWidth());
                    System.out.println("Height ====>> "+image.getHeight());
                    ocr(image);
                }
            }
        } // end while loop
    } 
    catch (IOException ex) {
        System.out.println("" + ex);
    }
}
public static void ocr(BufferedImage image) {
    try {
        System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
        System.load("C:\\Program Files\\Tesseract-OCR\\gsdll64.dll");
        Tesseract tessInst = new Tesseract();
        tessInst.setDatapath("D:\\tesseract\\");
        tessInst.setLanguage("eng");
        String result = tessInst.doOCR(image);
        System.out.println(result);
    }
    catch (TesseractException e) {
        e.printStackTrace();
    }
}

使用OCR将图像转换为文本后,BufferedImage显示为空。

0 个答案:

没有答案