以编程方式搜索PDF文件中的文本并告诉页码?

时间:2009-04-02 13:00:50

标签: .net pdf text-search

有些工具允许提取PDF文件的整个文本部分,以便对PDF进行全文索引。

我需要的是一种搜索某些字符串的方法,如果在PDF文件中找到了该字符串,则返回页码?

2 个答案:

答案 0 :(得分:2)

此示例使用Adobe Reader附带的库,来自http://www.dotnetspider.com/resources/5040-Get-PDF-Page-Number.aspx

using Acrobat;
using AFORMAUTLib;                          
private void pdfRandD(string fPath)
{
    AcroPDDocClass objPages = new AcroPDDocClass();
    objPages.Open(fPath);
    long TotalPDFPages = objPages.GetNumPages();            
    objPages.Close();
    AcroAVDocClass avDoc = new AcroAVDocClass();
    avDoc.Open(fPath, "Title");
    IAFormApp formApp = new AFormAppClass();
    IFields myFields = (IFields)formApp.Fields;            
    string searchWord = "Search String";
    string k = "";
    StreamWriter sw = new
        StreamWriter(@"D:\KCG_FileChecker_Inputs\MAC\pdf\0230_525490_23_cha17.txt", false);
    for (int p = 0; p < TotalPDFPages; p++)
    {                
        int numWords = int.Parse(myFields.ExecuteThisJavascript("event.value=this.getPageNumWords(" + p + ");"));
        k = "";
        for (int i = 0; i < numWords; i++)
        {
            string chkWord = myFields.ExecuteThisJavascript("event.value=this.getPageNthWord(" + p + "," + i + ", true);");
            k = k + " " + chkWord;
        }                
        if(k.Trim().Contains(searchWord))
        {
           int pNum = int.Parse(myFields.ExecuteThisJavascript("event.value=this.getPageLabel(" + p + ",true);"));
           sw.WriteLine("The Word " + searchWord + " is exists in " + pNum);                    
        }

     }
     sw.Close();
     MessageBox.Show("Process completed");
}

答案 1 :(得分:2)

您可以使用Docotic.Pdf library搜索PDF文件中的文字。

以下示例显示了如何在PDF文件和相应的页码中查找指定的字符串:

static void searchForTextStrings()
{
    string path = "";
    string[] stringsToFind = new string[] { };

    using (PdfDocument pdf = new PdfDocument(path))
    {
        for (int i = 0; i < pdf.Pages.Count; i++)
        {
            string pageText = pdf.Pages[i].GetText();
            foreach (string s in stringsToFind)
            {
                int index = pageText.IndexOf(s, 0, StringComparison.CurrentCultureIgnoreCase);
                if (index != -1)
                    Console.WriteLine("'{0}' found on page {1}", s, i);
            }
        }
    }
}

如果删除IndexOf方法的第三个参数,则可以进行区分大小写的搜索。

免责声明:我为图书馆的供应商Bit Miracle工作。