Question

我正在使用iTextSharp进行PDF处理，我需要从以某种字体编写的现有PDF中提取所有文本。

A way to do that指向inherit from a RenderFilter，并且仅允许具有特定PostscriptFontName的文本。问题是，当我这样做时，我在PDF中看到以下字体名称：

CIDFont+F1
CIDFont+F2
CIDFont+F3
CIDFont+F4
CIDFont+F5

这与我要查找的实际字体名称不同。

我尝试过enumerating the font resources，它显示出相同的结果。
我尝试了opening the PDF in the full Adobe Acrobat。它还显示了变形的字体名称：
我尝试使用iText RUPS分析文件。结果相同。

也就是说，我无法在文档结构中的任何位置看到实际的字体名称。

但是，当我在文档画布上选择各种文本框（例如Arial，Courier New，Roboto）时，Adobe Acrobat DC确实在“格式”窗格中显示了正确的字体名称，因此信息必须存储在某个地方。

使用iTextSharp解析PDF时，如何获得真实的字体名称？

Answer 1

在对问题的评论过程中确定，字体名称在所有该字体的PDF元数据中都被匿名化，但是嵌入式字体程序本身包含实际的字体名称。

（因此从严格意义上讲PDF是损坏的，即使几乎没有任何软件会抱怨这种方式。）

因此，如果要检索这些名称，则必须查看这些字体程序。

以下是您引用的this answer中使用的体系结构的概念证明，即使用RenderFilter：

class FontProgramRenderFilter : RenderFilter
{
    public override bool AllowText(TextRenderInfo renderInfo)
    {
        DocumentFont font = renderInfo.GetFont();
        PdfDictionary fontDict = font.FontDictionary;
        PdfName subType = fontDict.GetAsName(PdfName.SUBTYPE);
        if (PdfName.TYPE0.Equals(subType))
        {
            PdfArray descendantFonts = fontDict.GetAsArray(PdfName.DESCENDANTFONTS);
            PdfDictionary descendantFont = descendantFonts[0] as PdfDictionary;
            PdfDictionary fontDescriptor = descendantFont.GetAsDict(PdfName.FONTDESCRIPTOR);
            PdfStream fontStream = fontDescriptor.GetAsStream(PdfName.FONTFILE2);
            byte[] fontData = PdfReader.GetStreamBytes((PRStream)fontStream);
            MemoryStream dataStream = new MemoryStream(fontData);
            dataStream.Position = 0;
            MemoryPackage memoryPackage = new MemoryPackage();
            Uri uri = memoryPackage.CreatePart(dataStream);
            GlyphTypeface glyphTypeface = new GlyphTypeface(uri);
            memoryPackage.DeletePart(uri);
            ICollection<string> names = glyphTypeface.FamilyNames.Values;
            return names.Where(name => name.Contains("Arial")).Count() > 0;
        }
        else
        {
            // analogous code for other font subtypes
            return false;
        }
    }
}

MemoryPackage类来自this answer，这是我的第一个发现，搜索如何使用.Net从内存中的字体读取信息。

像这样应用于您的PDF文件：

using (PdfReader pdfReader = new PdfReader(SOURCE))
{
    FontProgramRenderFilter fontFilter = new FontProgramRenderFilter();
    ITextExtractionStrategy strategy = new FilteredTextRenderListener(
            new LocationTextExtractionStrategy(), fontFilter);
    Console.WriteLine(PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy));
}

结果是

This is Arial.

当心：这仅仅是概念的证明。

一方面，您肯定还需要实现上面analogous code for other font subtypes处注释的部分；甚至TYPE0部分还没有准备好用于生产，因为它仅考虑FONTFILE2且不能优雅地处理null值。

另一方面，您将要为已检查的字体缓存名称。

如何使用iTextSharp从PDF中提取实际字体名称？

1 个答案: