使用ICEpdf提取PDF页面特定区域中的文本

时间:2011-05-02 08:24:33

标签: java pdf extraction text-extraction icepdf

有没有办法使用ICEpdf提取特定区域的文本?我能够提取整个页面,但这不是我想要做的。

(我知道PDFBox很好地提取了页面特定矩形区域中的文本。但是,由于ICEpdf中的图像渲染效果更好,我想使用该库。)

2 个答案:

答案 0 :(得分:3)

在代表页面的Page对象上,您可以调用方法:

PageText pageText = document.getPageText(pagNumber);

与包示例类似./examples/extraction/PageTextExtraction.java

PageText对象包含页面的所有LineText-> WordText-> GlyphText对象。 LineText,WordText和GlyphText都扩展了AbstractText,它有一个getBounds()方法。这些对象的边界位于PDF用户空间中,即第一个几何象限。 Java2D位于第四个几何象限。假设您已经有selectionRectangle,代码如下:

//  the currently selected state, ignore highlighted.
currentPage.getViewText().clearSelected();

// get page transform, same for all calculations
AffineTransform pageTransform = currentPage.getPageTransform(
        Page.BOUNDARY_CROPBOX,
        documentViewModel.getViewRotation(),
        documentViewModel.getViewZoom());

Rectangle2D.Float pageSpaceSelectRectangle =
        convertRectangleToPageSpace(selectionRectangle, pageTransform);
ArrayList pageLines = pageText.getPageLines();
for (LineText pageLine : pageLines) {
    // check for containment, if so break into words.
    if (pageLine.getBounds().intersects(pageSpaceSelectRectangle )) {
        // you have some selected text. 
    }
}



    /**
     * Converts the rectangle to the space specified by the page tranform. This
     * is a utility method for converting a selection rectangle to page space
     * so that an intersection can be calculated to determine a selected state.
     *
     * @param mouseRect     rectangle to convert space of
     * @param pageTransform page transform
     * @return converted rectangle.
     */
    private Rectangle2D convertRectangleToPageSpace(Rectangle mouseRect,
                                                    AffineTransform pageTransform) {
        GeneralPath shapePath;
        try {
            AffineTransform tranform = pageTransform.createInverse();
            shapePath = new GeneralPath(mouseRect);
            shapePath.transform(tranform);
            return shapePath.getBounds2D();
        } catch (NoninvertibleTransformException e) {
            logger.log(Level.SEVERE,
                    "Error converting mouse point to page space.", e);
        }
        return null;
    }

答案 1 :(得分:2)

您是否已在icepdf论坛上发帖?他们通常很善于回答那里的问题吗?