Question

将pdf点转换为像素，正常工作：点到像素= 1/72 * 300（DPI）

获取PDF中的每个文本块位置（X，Y）Y是从
计算的从下到上，而不是在标准的html或java脚本中。
从上到下获取Y值，导致不准确的Y位置，如中 html风格，或赢得表格风格。
如何使用任何Page高度或rect mediaBox来从上到下获取正确的Y. 或cropBox或rect textMarging finder？

我使用的代码就是你的例子：

public class LocationTextExtractionStrategyClass : LocationTextExtractionStrategy
{
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();
    /*
    //The string that we're searching for
    public String TextToSearchFor { get; set; }

    //How to compare strings
    public System.Globalization.CompareOptions CompareOptions { get; set; }

    public MyLocationTextExtractionStrategy(String textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None)
    {
        this.TextToSearchFor = textToSearchFor;
        this.CompareOptions = compareOptions;
    }
    */
    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo)
    {
        base.RenderText(renderInfo);

        //See if the current chunk contains the text
        var startPosition = 0;// System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);

        //If not found bail
        if (startPosition < 0)
        {
            return;
        }

        //Grab the individual characters
        var chars = renderInfo.GetCharacterRenderInfos().ToList();//.Skip(startPosition).Take(this.TextToSearchFor.Length)
        var charsText = renderInfo.GetText();

        //Grab the first and last character
        var firstChar = chars.First();
        var lastChar  = chars.Last();

        //Get the bounding box for the chunk of text
        var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
        var topRight   = lastChar.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        BaseColor curColor = new BaseColor(0f, 0f, 0f);
        if (renderInfo.GetFillColor() != null)
            curColor = renderInfo.GetFillColor();

        //Add this to our main collection
        myPoints.Add(new RectAndText(rect, charsText, curColor));//this.TextToSearchFor));
    }
}//end-of-txtLocation-class//

Answer 1

您在一篇文章中提出了许多不同的问题。

首先让我们从PDF标准中的坐标系开始。请注意，我正在谈论标准，更具体地说是关于ISO 32000.PDF页面上的坐标系在我对Stack Overflow问题的回答中进行了解释How should I interpret the coordinates of a rectangle in PDF?

如您所见，使用左下角的坐标(llx, lly)和右上角的坐标(urx, ury)在PDF中绘制的矩形假设X轴指向在右边，Y轴指向上方。

关于页面的宽度和高度，我在回答堆栈溢出问题How to Get PDF page width and Height?

时解释了这个问题。

例如：您可以将/MediaBox定义为[0 0 595 842]，因此可以测量595 x 842点（A4页面），但定义的/CropBox为[5 5 590 837]，表示可见区域仅为585 x 832点。

您也不应该假设页面的左下角与(0, 0)坐标重合。见Where is the Origin (x,y) of a PDF page?

当您从头开始创建文档时，如果您自己没有定义边距，则会使用默认的半英寸边距。如果您想更改默认值，请参阅Fit content on pdf size with iTextSharp?

现在是Chunk的高度，或者，如果您使用的是iText 7（您应该，但是 - 由于某种原因我不知道＆＃39; t）Text对象的高度，这取决于字体大小。字体大小是字体中不同字形的平均大小。如果您查看字母 g ，并将其与 h 字母进行比较，您会发现 g 在基线下面需要更多空间文字比 h ，而 h 比 g 在基线上方占用的空间更多。

如果您想计算确切的空间，请阅读我对问题的回答How to calculate the height of an element?

如果在段落中的行的上下文中使用了文字片段，则还必须将引导至帐户：Changing text line spacing（可能与此无关）你的问题的背景，但知道这很好。）

如果您在iText 5中有Chunk个对象，并且您希望使用这些Chunk执行特定操作，则可能会因使用页面事件而受益。见How to draw a line every 25 words?

如果您想为Chunk添加彩色背景，那就更容易了：How to set the paragraph of itext pdf file as rectangle with background color in Java

更新1：如果您希望将HTML转换为PDF，则上述所有内容可能无关紧要。在这种情况下，它很简单：使用Converting HTML to PDF using iText中描述的iText 7 + pdfHTML，所有数学都由pdfHTML插件完成。

更新2：关于测量单位似乎有些混乱。 用户单元，点和像素之间的差异在常见问题解答页面How do the measurement systems in HTML relate to the measurement system in PDF?

中进行了解释

总结：

1 in. = 25.4 mm = 72 user units by default (but it can be changed).
1 in. = 25.4 mm = 72 pt.
1 in. = 25.4 mm = 96 px.

文本块在html中获取位置？

1 个答案: