Question

预计会出现此问题，因为True Type字体确实是图像，而不是字体。您必须使用图像识别技术才能完成阅读。这个问题多次出现，所以我正在向公众提出答案。

问：当无法读取PDF的字体以用于定位目的时，如何解析PDF。 EX：要知道第1页的帐号，或者页码为“例如双面打印，而不是文档计数”。

管理语句时遇到了这个问题。我需要知道我在哪个页面，我在哪里，以及它上面有什么。我开始意识到不同的打印软件输出不同的文件需求，但你通常可以在PDF输出文件的注释中找到它们，你正在读书。例如，我正在使用“Tray Call ID”，我在PDF中找到了我我正在阅读iTextSharp。以下是一个例子：

Answer 1

我首先使用一个简单的方法来测试文档的字体类型

 public void SetFontType()
    {            
       this.PdfReaderContentParser = new PdfReaderContentParser(this.PdfReaderMain);

        //Here we see if we can read the text from the extraction. If not, we know it is a TT font.
        ITextExtractionStrategy iTextExtractionStrategy = this.PdfReaderContentParser.ProcessContent(1, new SimpleTextExtractionStrategy());

        String pdfText = iTextExtractionStrategy.GetResultantText();

        this.TextType = String.IsNullOrEmpty(pdfText) ? TextType.TrueTypeFont : TextType.Default;            
    }

当我确定它是不可读的，并且遇到了True Type字体的情况，然后我执行以下操作来阅读PDF [不包含非必要代码]

以下代码循环显示注释以查找要搜索的任何特殊内容。在这种情况下，我正在寻找MT3类型搜索，或者我在覆盖中使用的项目列表。每个案例都是独一无二的，但它总结了剥离文档注释的基本概念。这也在iText的文档中进行了简要说明。

public static Boolean CycleAnnotations(PdfReader reader, int pageIndex, PdfJob job)
        {
            List<string> keys = job.ConfigurationSettings.Where(cfs => cfs.Condition != null).Select(cs => cs.Condition).ToList();

            bool found = CycleAnnotations(reader, pageIndex, keys);

            if (found)
            {
                return found;
            }
            else
            {
                return CycleAnnotations(reader, pageIndex, "MT(TR3)"); //default key
            }
        }

        public static Boolean CycleAnnotations(PdfReader reader, int pageIndex, string key)
        {
            PdfDictionary pdfDictionary = reader.GetPageN(pageIndex);
            PdfArray annots = pdfDictionary.GetAsArray(PdfName.ANNOTS);

            if (annots != null)
            {
                foreach (var iter in annots)
                {
                    PdfDictionary annot = (PdfDictionary)PdfReader.GetPdfObject(iter);
                    PdfString content = (PdfString)PdfReader.GetPdfObject(annot.Get(PdfName.CONTENTS));
                    if (content != null)
                    {
                        if (Utilities.IsAnnotationFound(content, key))
                        {
                            return true;
                        }
                    }
                }

            }

            return false;
        }

        public static Boolean CycleAnnotations(PdfReader reader, int pageIndex, List<string> keys)
        {
            PdfDictionary pdfDictionary = reader.GetPageN(pageIndex);
            PdfArray annots = pdfDictionary.GetAsArray(PdfName.ANNOTS);

            foreach (string keyItem in keys)
            {
                if (annots != null)
                {
                    foreach (var iter in annots)
                    {
                        PdfDictionary annot = (PdfDictionary)PdfReader.GetPdfObject(iter);
                        PdfString content = (PdfString)PdfReader.GetPdfObject(annot.Get(PdfName.CONTENTS));
                        if (content != null)
                        {
                            if (Utilities.IsAnnotationFound(content, keyItem))
                            {
                                return true;
                            }
                        }
                    }

                }
            }

希望这有助于某人，并度过美好的一天！

iTextSharp无法使用True Type字体查看PDF上的文本

1 个答案: