有没有办法使用ICEpdf提取特定区域的文本?我能够提取整个页面,但这不是我想要做的。
(我知道PDFBox很好地提取了页面特定矩形区域中的文本。但是,由于ICEpdf中的图像渲染效果更好,我想使用该库。)
答案 0 :(得分:3)
在代表页面的Page对象上,您可以调用方法:
PageText pageText = document.getPageText(pagNumber);
与包示例类似./examples/extraction/PageTextExtraction.java
PageText对象包含页面的所有LineText-> WordText-> GlyphText对象。 LineText,WordText和GlyphText都扩展了AbstractText,它有一个getBounds()方法。这些对象的边界位于PDF用户空间中,即第一个几何象限。 Java2D位于第四个几何象限。假设您已经有selectionRectangle,代码如下:
// the currently selected state, ignore highlighted. currentPage.getViewText().clearSelected(); // get page transform, same for all calculations AffineTransform pageTransform = currentPage.getPageTransform( Page.BOUNDARY_CROPBOX, documentViewModel.getViewRotation(), documentViewModel.getViewZoom()); Rectangle2D.Float pageSpaceSelectRectangle = convertRectangleToPageSpace(selectionRectangle, pageTransform); ArrayList pageLines = pageText.getPageLines(); for (LineText pageLine : pageLines) { // check for containment, if so break into words. if (pageLine.getBounds().intersects(pageSpaceSelectRectangle )) { // you have some selected text. } } /** * Converts the rectangle to the space specified by the page tranform. This * is a utility method for converting a selection rectangle to page space * so that an intersection can be calculated to determine a selected state. * * @param mouseRect rectangle to convert space of * @param pageTransform page transform * @return converted rectangle. */ private Rectangle2D convertRectangleToPageSpace(Rectangle mouseRect, AffineTransform pageTransform) { GeneralPath shapePath; try { AffineTransform tranform = pageTransform.createInverse(); shapePath = new GeneralPath(mouseRect); shapePath.transform(tranform); return shapePath.getBounds2D(); } catch (NoninvertibleTransformException e) { logger.log(Level.SEVERE, "Error converting mouse point to page space.", e); } return null; }
答案 1 :(得分:2)
您是否已在icepdf论坛上发帖?他们通常很善于回答那里的问题吗?