Question

我正在使用iTextSharp和reader.GetPageContent方法从PDF中提取文本。我需要找到文档中找到的每个单词的矩形/位置。有没有办法使用iTextSharp获取PDF中单词的矩形/位置？

Answer 1

是的。查看text.pdf.parser包，特别是LocationTextExtractionStrategy。实际上，这可能也不行。你可能想写自己的TextExtractionStrategy来输入PdfTextExtractor：

MyTexExStrat strat = new MyTexExStrat();
PdfTextExtractor.getTextFromPage(reader, pageNum, strat);
// get the strings-n-rects from strat.

public class MyTexExStrat implements TextExtractionStrategy {
    void beginTextBlock() {}
    void endTextBlock() {}
    void renderImage(ImageRenderInfo info) {}
    void renderText(TextRenderInfo info) {
      // track text and location here.
    }
}

您可能希望查看LocationTextExtractionStrategy的源代码，以了解它如何组合共享基线的文本。您甚至可以修改LTES以存储字符串和rects的并行数组。

PS：要建立rects，你可以获得AscentLine＆amp; amp; DescentLine并使用这些坐标作为顶角和底角：

Vector bottomLeft = info.getDescentLine().getStartPoint();
Vector topRight = info.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(bottomLeft.get(Vector.I1),
                               bottomLeft.get(Vector.I2),
                               topRight.get(Vector.I1),
                               topRight.get(Vector.I2));

警告：上面的代码说明文本是水平的，从左到右进行。旋转的文本会将其搞砸，垂直文本或从右到左（阿拉伯语，希伯来语）文本也是如此。对于大多数应用程序，上面应该没问题，但知道它的限制。

好狩猎。

iTextSharp - 如何在页面上获取单词的位置

1 个答案: