如何使用tess4j检测pdf中的文本块和列

时间:2017-02-23 14:14:11

标签: java ocr tesseract tess4j

我是Tesseract(tess4j)的新手,设法使用主要功能,如阅读文字或从图像或pdf,旋转等单词位置。

我无法找到,也不确定是否可以轻松检测文本块(段落或列)。 此外,如果pdf中有一些其他块像图像或其他东西,是否有可能以某种方式获得它,或者至少获得块的位置(框)。

2 个答案:

答案 0 :(得分:1)

您可以使用TessBaseAPIGetComponentImages API方法,如下所示:

Boxa boxes = api.TessBaseAPIGetComponentImages(handle, TessPageIteratorLevel.RIL_BLOCK, TRUE, null, null);

检查Tess4J unit tests以获取完整的示例。

答案 1 :(得分:1)

我已经接受了答案,但这是答案的结果:

public Page recognizeTextBlocks(Path path) {
        log.info("TessBaseAPIGetComponentImages");
        File image = new File(path.toString());
        Leptonica leptInstance = Leptonica.INSTANCE;
        Pix pix = leptInstance.pixRead(image.getPath());
        Page blocks = new Page(pix.w,pix.h);        
        api.TessBaseAPIInit3(handle, datapath, language);
        api.TessBaseAPISetImage2(handle, pix);
        api.TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD);
        PointerByReference pixa = null;
        PointerByReference blockids = null;
        Boxa boxes = api.TessBaseAPIGetComponentImages(handle, TessPageIteratorLevel.RIL_BLOCK, FALSE, pixa, blockids);
        int boxCount = leptInstance.boxaGetCount(boxes);
        for (int i = 0; i < boxCount; i++) {
            Box box = leptInstance.boxaGetBox(boxes, i, L_CLONE);
            if (box == null) {
                continue;
            }
            api.TessBaseAPISetRectangle(handle, box.x, box.y, box.w, box.h);
            Pointer utf8Text = api.TessBaseAPIGetUTF8Text(handle);
            String ocrResult = utf8Text.getString(0);
            Block block = null;
            if(ocrResult == null || (ocrResult.replace("\n", "").replace(" ","")).length() == 0){
                block = new ImageBlock(new Rectangle(box.x, box.y, box.w, box.h));
            }else{
                block = new TextBlock(new Rectangle(box.x, box.y, box.w, box.h), ocrResult); 
            }
            blocks.add(block);
            api.TessDeleteText(utf8Text);
            int conf = api.TessBaseAPIMeanTextConf(handle);
            log.debug(String.format("Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s", i, box.x, box.y, box.w, box.h, conf, ocrResult));
        }

        //release Pix resource
        PointerByReference pRef = new PointerByReference();
        pRef.setValue(pix.getPointer());
        leptInstance.pixDestroy(pRef);

        return blocks;
    }

注意:类Block,ImageBlock和TextBlock来自我的项目,不属于tess4j或tesseract