Java8,Tess4j:使用tesseract优化OCR图像

时间:2017-08-18 10:34:43

标签: java bitmap tesseract tess4j

我正在研究Tesseract,我已经有了OCR功能。我想优化图像,以便OCR结果更好。目前我只是使图像单色并将其缩放到其尺寸的两倍。即使在那之后,我遇到了较小字体的问题。

我试着抬头,here是我能找到的最佳答案之一。不幸的是,它适用于Bitmap,我找不到Java中适用于Bitmap的任何本机类。还有一个Java代码的答案,但它再次使用Bitmap,并没有指定他们从哪个包获得它。

BitmapImageUtil.convertToGrayscale()来自哪里?

代码:

private String testOcr(String fileLocation, int attachId) {
        try {
            File imageFile = new File(fileLocation);
            BufferedImage img = ImageIO.read(imageFile);
            String identifier = String.valueOf(new BigInteger(130, random).toString(32));
            String blackAndWhiteImage = previewPath + identifier + ".png";
            File outputfile = new File(blackAndWhiteImage);
            BufferedImage bufferedImage = BitmapImageUtil.convertToGrayscale(img,new Dimension(img.getWidth(),img.getHeight()));
            bufferedImage = Scalr.resize(bufferedImage,img.getWidth()*2,img.getHeight()*2);
            ImageIO.write(bufferedImage,"png",outputfile);

            ITesseract instance = Tesseract.getInstance();
            // Point to one folder above tessdata directory, must contain training data
            instance.setDatapath("/usr/share/tesseract-ocr/");
            // ISO 693-3 standard
            instance.setLanguage("deu");
            String result = instance.doOCR(outputfile);
// result processing with regex. 
}

1 个答案:

答案 0 :(得分:0)

BitmapImageUtil来自Apache FOP project。 (“FOP”=“格式化对象处理器”)

套餐为org.apache.fop.util.bitmap

Source code for release 2.2 is available here