Question

按照以下帖子： iTextSharp PDF Reading highlighed text (highlight annotations) using C#

此代码：

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}

正致力于提取PDF注释。但是为什么下面的代码不能用于突出显示（特别是PdfName.HIGHLIGHT不起作用）：

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}

Answer 1

请查看ISO-32000-1（又名PDF参考）中的表30。它被命名为＆＃34;页面对象中的条目＆＃34;。在这些条目中，您可以找到名为Annots的密钥。它的价值是：

（可选）应包含的注释字典数组间接引用与页面关联的所有注释（请参阅 12.5，＆＃34;注释＆＃34;）。

您将找不到包含Highlight等密钥的条目，因此当您拥有此行时，返回的数组为空是正常的：

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);

您需要按照以前的方式获取注释：

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);

现在，您需要遍历此数组并查找Subtype等于Highlight的注释。这种类型的注释列在ISO-32000-1的表169中，标题为＆＃34;注释类型＆＃34;。

换句话说，您假设页面词典包含带有键Highlight的条目是错误的，如果您阅读整个规范，您还会发现您已经做出的另一个错误假设。您错误地认为突出显示的文本存储在注释的Contents条目中。这表明对注释与页面内容的性质缺乏了解。

您要查找的文本存储在页面的内容流中。页面的内容流独立于页面的注释。因此，要获取突出显示的文本，您需要获取存储在Highlight注释中的坐标（存储在QuadPoints数组中），您需要使用这些坐标来解析存在于其中的文本。这些坐标的页面内容。

Answer 2

以下是使用itextSharp

提取突出显示文本的完整示例

    public void GetRectAnno()
    {

        string appRootDir = new DirectoryInfo(Environment.CurrentDirectory).Parent.Parent.FullName;

        string filePath = appRootDir + "/PDFs/" + "anot.pdf";

        int pageFrom = 0;
        int pageTo = 0;

        try
        {
            using (PdfReader reader = new PdfReader(filePath))
            {
                pageTo = reader.NumberOfPages;

                for (int i = 1; i <= reader.NumberOfPages; i++)
                {


                    PdfDictionary page = reader.GetPageN(i);
                    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
                    if (annots != null)
                        foreach (PdfObject annot in annots.ArrayList)
                        {

                            //Get Annotation from PDF File
                            PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annot);
                            PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
                            //check only subtype is highlight
                            if (subType.Equals(PdfName.HIGHLIGHT))
                            {
                                 // Get Quadpoints and Rectangle of highlighted text
                                Console.Write("HighLight at Rectangle {0} with QuadPoints {1}\n", annotationDic.GetAsArray(PdfName.RECT), annotationDic.GetAsArray(PdfName.QUADPOINTS));

                                //Extract Text using rectangle strategy    
                                PdfArray coordinates = annotationDic.GetAsArray(PdfName.RECT);

                                Rectangle rect = new Rectangle(float.Parse(coordinates.ArrayList[0].ToString(), CultureInfo.InvariantCulture.NumberFormat), float.Parse(coordinates.ArrayList[1].ToString(), CultureInfo.InvariantCulture.NumberFormat),
                                float.Parse(coordinates.ArrayList[2].ToString(), CultureInfo.InvariantCulture.NumberFormat),float.Parse(coordinates.ArrayList[3].ToString(), CultureInfo.InvariantCulture.NumberFormat));



                                RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
                                ITextExtractionStrategy strategy;
                                StringBuilder sb = new StringBuilder();


                                strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
                                sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i, strategy));

                                //Show extract text on Console
                                Console.WriteLine(sb.ToString());
                                //Console.WriteLine("Page No" + i);

                            }



                        }



                }
            }
        }
        catch (Exception ex)
        {
        }
    }

如何使用iTextSharp从PDF中提取突出显示的文本？

2 个答案: