Question

问题：给定PDF文件，我可以（轻松）使用PDFsharp（或其他.NET兼容的PDF库）检查重叠文本吗？

首选检查重叠字母（两个不同文本块）的解决方案，但也可以选择仅检查重叠边界框的解决方案。

我已经尝试了什么：一个明显的解决方案是使用边界框提取所有文本组件并检查重叠。但是，我没有在PDFsharp中找到用它们的边界框提取文本组件的方法。致avoid the XY problem，我要求提出一般性问题，而不是如何使用PDFsharp提取文本。

背景：我正在为报告组件编写单元测试。报告以PDF文件的形式生成，使用RDLC报告的PDF呈现组件以及使用PdfSharp的直接PDF输出。

在我的单元测试中，我想使用不同的数据集和语言测试这些报告，并找出是否存在重叠文本。目前，该单元测试只是导出我要测试的每个组合的PDF，有人必须手动浏览它们。我想自动化。

Answer 1

下面的代码显示了如何使用XFINIUM.PDF库实现此检测（因为您询问了包括其他库的解决方案）：

public void TestCharacterOverlap()
{
    PdfFixedDocument document = new PdfFixedDocument("sample.pdf");

    for (int i = 0; i < document.Pages.Count; i++)
    {
        List<PdfVisualRectangle[]> overlaps = GetPageOverlaps(document.Pages[i]);
        if (overlaps.Count > 0)
        {
            // We have character overlapping.
        }
    }
}

public List<PdfVisualRectangle[]> GetPageOverlaps(PdfPage page)
{
    List<PdfVisualRectangle[]> overlaps = new List<PdfVisualRectangle[]>();

    PdfContentExtractor ce = new PdfContentExtractor(page);
    PdfTextFragmentCollection tfc = ce.ExtractTextFragments();

    for (int i = 0; i < tfc.Count; i++)
    {
        PdfTextGlyphCollection currentGlyphs = tfc[i].Glyphs;

        for (int j = 0; j < currentGlyphs.Count; j++)
        {
            // Start comparing current glyph to remaining extracted glyphs.
            for (int k = i; k < tfc.Count; k++)
            {
                PdfTextGlyphCollection nextGlyphs = tfc[k].Glyphs;
                // l = j + 1 - we avoid comparing current glyph with itself
                for (int l = j + 1; l < nextGlyphs.Count; l++)
                {
                    PdfVisualRectangle crtGlyphRect = GetGlyphRectangle(currentGlyphs[j].GlyphCorners);
                    PdfVisualRectangle nextGlyphRect = GetGlyphRectangle(nextGlyphs[l].GlyphCorners);
                    if (Intersect(crtGlyphRect, nextGlyphRect))
                    {
                        PdfVisualRectangle[] overlap = new PdfVisualRectangle[] { crtGlyphRect, nextGlyphRect };
                        overlaps.Add(overlap);
                    }
                }
            }
        }
    }

    return overlaps;
}

public PdfVisualRectangle GetGlyphRectangle(PdfPoint[] glyphCorners)
{
    double minX = Math.Min(Math.Min(glyphCorners[0].X, glyphCorners[1].X), Math.Min(glyphCorners[2].X, glyphCorners[3].X));
    double minY = Math.Min(Math.Min(glyphCorners[0].Y, glyphCorners[1].Y), Math.Min(glyphCorners[2].Y, glyphCorners[3].Y));
    double maxX = Math.Max(Math.Max(glyphCorners[0].X, glyphCorners[1].X), Math.Max(glyphCorners[2].X, glyphCorners[3].X));
    double maxY = Math.Max(Math.Max(glyphCorners[0].Y, glyphCorners[1].Y), Math.Max(glyphCorners[2].Y, glyphCorners[3].Y));

    return new PdfVisualRectangle(minX, minY, maxX - minX, maxY - minY);
}

public bool Intersect(PdfVisualRectangle rc1, PdfVisualRectangle rc2)
{
    bool intersect = (rc1.Left < rc2.Left + rc2.Width) && (rc1.Left + rc1.Width > rc2.Left) &&
        (rc1.Top < rc2.Top + rc2.Height) && (rc1.Top + rc1.Height > rc2.Top);

    return intersect;
}

关于代码的一些注释：
- 在大多数情况下（常规水平文本），字形角（4个点）形成一个矩形。但对于对角文本或倾斜字符，字形角是四边形，因此您必须实现更复杂的交叉过程 - 可以进一步抛光重叠测试以允许小程度的重叠，如果交叉点大于字符区域的X％，则说2个字符重叠。这就是GetPageOverlaps方法返回一对配对矩形的原因，以便在需要时进一步处理它们。

免责声明：我为开发XFINIUM.PDF库的公司工作。

PDFsharp可以帮我检测重叠文本吗？

1 个答案: