Question

我正在尝试使用带有c＃的iText库来捕获pdf文件的文本部分。

我从excel 2013创建了一个pdf（导出），然后从网上复制了如何使用itext的示例（将lib ref添加到项目中）。

它完美地读取了第一页，但之后会出现乱码信息。它保留了第一页的一部分并将信息与下一页合并。注释行是在我尝试解决问题时，在for循环中重新创建字符串“thePage”。

这是代码。我可以通过电子邮件将pdf发送给任何可以帮助解决此问题的人。

提前致谢

   public static string ExtractTextFromPdf(string path)
    {

        ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();

        using (PdfReader reader = new PdfReader(path))
        {
            StringBuilder text = new StringBuilder();

            //string[] theLines;
            //theLines = new string[COLUMNS];
            //string thePage;

            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                string thePage = "";
                thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);

                string [] theLines = thePage.Split('\n');
                foreach (var theLine in theLines)
                {
                    text.AppendLine(theLine);
                }
             //   text.AppendLine(" ");
            //    Array.Clear(theLines, 0, theLines.Length);
            //    thePage = "";
            }
            return text.ToString();
        }
    }

Answer 1

策略对象收集文本数据，但不知道新页面是否已启动。

因此，为每个页面使用新的策略对象。

iText在第一页后不返回PDF的文本内容

1 个答案: