使用lucene.Net搜索问题

时间:2014-09-15 15:30:55

标签: c# indexing lucene lucene.net

我正在尝试使用Lucene .Net的搜索引擎。我在网站上关注了一些文档,但我可能错过了一些东西,因为它没有像预期的那样工作..

以下是代码:

var stringBuilder = new StringBuilder();
        var pdfReader = new PdfReader(@"c:\Test\testRoot.pdf");
        for (var page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            stringBuilder.Append(PdfTextExtractor.GetTextFromPage(pdfReader, page) + " ");
        }
        if (stringBuilder.ToString().Contains("new"))
        {
            Console.WriteLine("New is present in the text!");
        }
        const string strIndexDir = @"C:\Index";
        Directory indexDir = FSDirectory.Open(strIndexDir);
        Analyzer std = new StandardAnalyzer(Version.LUCENE_29);
        var idwx = new IndexWriter(indexDir, std, true, IndexWriter.MaxFieldLength.UNLIMITED);
        var doc = new Document();            
        var fdl = new Field("testRoot", stringBuilder.ToString(), Field.Store.YES, Field.Index.ANALYZED);
        doc.Add(fdl);
        idwx.AddDocument(doc);
        idwx.Optimize();
        idwx.Dispose();
        Console.WriteLine("Indexing Done !");


        var parser = new QueryParser(Version.LUCENE_29, "new", std);
        var qry = parser.Parse(parser.Field);
        Directory directory = FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir));
        Searcher srch = new IndexSearcher(IndexReader.Open(directory, true));
        TopScoreDocCollector cllstr = TopScoreDocCollector.Create(100, true);
        ScoreDoc[] hits = cllstr.TopDocs().ScoreDocs;
        for (int i = 0; i < hits.Length; i++)
        {
            int docId = hits[i].Doc;
            float score = hits[i].Score;
            Document docy = srch.Doc(docId);
            Console.WriteLine(docy.Get("text"));
        }
        Console.ReadLine();

问题是我的PDF文本中出现了new这个词,因为它出现在&#39; if&#39;。

但最后,当我试图寻找比赛时,没有什么在这里......

编辑:

我做了一些改动,但仍然没有改变:

var stringBuilder = new StringBuilder();
        var pdfReader = new PdfReader(@"c:\Test\testRoot.pdf");
        for (var page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            stringBuilder.Append(PdfTextExtractor.GetTextFromPage(pdfReader, page) + " ");
        }
        if (stringBuilder.ToString().Contains("new"))
        {
            Console.WriteLine("New is present in the text!");
        }
        const string strIndexDir = @"C:\Index";
        Directory indexDir = FSDirectory.Open(strIndexDir);
        Analyzer std = new StandardAnalyzer(Version.LUCENE_29);
        var idwx = new IndexWriter(indexDir, std, true, IndexWriter.MaxFieldLength.UNLIMITED);
        var doc = new Document();            
        var fdl = new Field("testRoot", stringBuilder.ToString(), Field.Store.YES, Field.Index.ANALYZED);
        doc.Add(fdl);
        idwx.AddDocument(doc);
        idwx.Optimize();
        idwx.Commit();
        idwx.Dispose();

        Console.WriteLine("Indexing Done !");
        var parser = new QueryParser(Version.LUCENE_29, "", std);
        var qry = parser.Parse("new*");
        Directory directory = FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir));
        Searcher srch = new IndexSearcher(IndexReader.Open(directory, true));
        var lol = srch.Search(qry, 100);
        ScoreDoc[] hits = lol.ScoreDocs;
        for (int i = 0; i < hits.Length; i++)
        {
            int docId = hits[i].Doc;
            float score = hits[i].Score;
            Document docy = srch.Doc(docId);
            Console.WriteLine(docy.Get("testRoot"));
        }

感谢您的帮助:)。

1 个答案:

答案 0 :(得分:1)

尝试:

var parser = new QueryParser(Version.LUCENE_29, "testRoot", std);

或者:

var qry = parser.Parse("testRoot:new*");

您需要指定要搜索的正确字段。testRoot似乎是您要查找的字段名称。 QueryParser的第二个参数指定要搜索的默认字段。在您提供的第一个示例中,您将其称为“新”,它似乎不是要添加到文档中的字段的名称(实际上,在这种情况下,您的查询看起来像:new:new) 。此默认字段将用于搜索,除非您指定要在查询中搜索的字段,例如myField:findThis(请参阅query parser syntax)。