Question

我正在寻找一种方法，使用itextsharp（v5.5.8）在关键字搜索的任意一侧选择文本n个字符。我已经到了可以使用SimpleTextExtractionStrategy（）的位置，并返回找到搜索文本的页面列表（据称）。当我使用PDF查看器搜索进行手动搜索时，有时它会在那里进行，有时它无法在页面上找到它.shaxtsharp说它已经开启了。有时，根本不是。

这个想法是能够在找到的关键字的任一侧返回40个字符，以允许用户在查看实际文档时更容易找到引用。在另一个问题中，我看到了对其他文本检索函数的引用（LocationTextExtractionStrategy，PdfTextExtractor.GetTextFromPage（myReader，pageNum）和一些Contains（word））。

在哪里可以找到如何使用这些功能的示例？以及如何制定更好的策略？

我目前的代码：

public  List<int> ReadPdfFile(string fileName, String searthText)
{
    string rootPath = HttpContext.Current.Server.MapPath("~");
    string dirPath = rootPath + @"content\publications\";

    List<int> pages = new List<int>();

    string fullFile = dirPath + fileName;
    if (File.Exists(fullFile))
    {
        PdfReader pdfReader = new PdfReader(fullFile);
        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

            string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
            if (currentPageText.Contains(searthText))
            {
                pages.Add(page);
            }
        }
        pdfReader.Close();
    }
    return pages;
}

使用简单的response.write命令输出的示例...

文件1.pdf 1 3 文件2.pdf 1 2 3 4

文件名后面的数字是找到搜索关键字的页码。但是，在文档1中，关键字也可以在＆＃34;参考文献＆＃34;的第4页的最顶部找到。从第3页开始的部分。应在参考文献中找到两次。

谢谢，鲍勃

P.S。显然5.5.8没有iTextSharp.text.pdf.parser.TextExtractionStrategy方法......

使用itextsharp和C＃读取pdf内容

0 个答案: