我正在尝试使用Lucene .Net的搜索引擎。我在网站上关注了一些文档,但我可能错过了一些东西,因为它没有像预期的那样工作..
以下是代码:
var stringBuilder = new StringBuilder();
var pdfReader = new PdfReader(@"c:\Test\testRoot.pdf");
for (var page = 1; page <= pdfReader.NumberOfPages; page++)
{
stringBuilder.Append(PdfTextExtractor.GetTextFromPage(pdfReader, page) + " ");
}
if (stringBuilder.ToString().Contains("new"))
{
Console.WriteLine("New is present in the text!");
}
const string strIndexDir = @"C:\Index";
Directory indexDir = FSDirectory.Open(strIndexDir);
Analyzer std = new StandardAnalyzer(Version.LUCENE_29);
var idwx = new IndexWriter(indexDir, std, true, IndexWriter.MaxFieldLength.UNLIMITED);
var doc = new Document();
var fdl = new Field("testRoot", stringBuilder.ToString(), Field.Store.YES, Field.Index.ANALYZED);
doc.Add(fdl);
idwx.AddDocument(doc);
idwx.Optimize();
idwx.Dispose();
Console.WriteLine("Indexing Done !");
var parser = new QueryParser(Version.LUCENE_29, "new", std);
var qry = parser.Parse(parser.Field);
Directory directory = FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir));
Searcher srch = new IndexSearcher(IndexReader.Open(directory, true));
TopScoreDocCollector cllstr = TopScoreDocCollector.Create(100, true);
ScoreDoc[] hits = cllstr.TopDocs().ScoreDocs;
for (int i = 0; i < hits.Length; i++)
{
int docId = hits[i].Doc;
float score = hits[i].Score;
Document docy = srch.Doc(docId);
Console.WriteLine(docy.Get("text"));
}
Console.ReadLine();
问题是我的PDF文本中出现了new这个词,因为它出现在&#39; if&#39;。
但最后,当我试图寻找比赛时,没有什么在这里......
编辑:
我做了一些改动,但仍然没有改变:
var stringBuilder = new StringBuilder();
var pdfReader = new PdfReader(@"c:\Test\testRoot.pdf");
for (var page = 1; page <= pdfReader.NumberOfPages; page++)
{
stringBuilder.Append(PdfTextExtractor.GetTextFromPage(pdfReader, page) + " ");
}
if (stringBuilder.ToString().Contains("new"))
{
Console.WriteLine("New is present in the text!");
}
const string strIndexDir = @"C:\Index";
Directory indexDir = FSDirectory.Open(strIndexDir);
Analyzer std = new StandardAnalyzer(Version.LUCENE_29);
var idwx = new IndexWriter(indexDir, std, true, IndexWriter.MaxFieldLength.UNLIMITED);
var doc = new Document();
var fdl = new Field("testRoot", stringBuilder.ToString(), Field.Store.YES, Field.Index.ANALYZED);
doc.Add(fdl);
idwx.AddDocument(doc);
idwx.Optimize();
idwx.Commit();
idwx.Dispose();
Console.WriteLine("Indexing Done !");
var parser = new QueryParser(Version.LUCENE_29, "", std);
var qry = parser.Parse("new*");
Directory directory = FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir));
Searcher srch = new IndexSearcher(IndexReader.Open(directory, true));
var lol = srch.Search(qry, 100);
ScoreDoc[] hits = lol.ScoreDocs;
for (int i = 0; i < hits.Length; i++)
{
int docId = hits[i].Doc;
float score = hits[i].Score;
Document docy = srch.Doc(docId);
Console.WriteLine(docy.Get("testRoot"));
}
感谢您的帮助:)。
答案 0 :(得分:1)
尝试:
var parser = new QueryParser(Version.LUCENE_29, "testRoot", std);
或者:
var qry = parser.Parse("testRoot:new*");
您需要指定要搜索的正确字段。testRoot
似乎是您要查找的字段名称。 QueryParser
的第二个参数指定要搜索的默认字段。在您提供的第一个示例中,您将其称为“新”,它似乎不是要添加到文档中的字段的名称(实际上,在这种情况下,您的查询看起来像:new:new
) 。此默认字段将用于搜索,除非您指定要在查询中搜索的字段,例如myField:findThis
(请参阅query parser syntax)。