基本上我索引了85k个html文件(谷歌结果页面和关键词是不同的大学名称),并且我在每个lucene索引中使用每个页面的标题作为名为“title”的字段。当我搜索关键词如 “duquesne AND university” 时,没有结果出来,但是,当我将关键词更改为 “duquesne”< / em> ,我可以获得标题结果:“标题:Duquesne Univeristy - Google搜索” 为什么会这样?从第二次尝试我可以告诉这个标题为Duquesne Univeristy的文件被编入索引,但我无法从第一次尝试获得它。 很多Thx!〜
以下是构建索引的代码,我使用Jsoup从网页获取标题:
//indexDir is the directory that hosts Lucene's index files
File indexDir = new File("F:\\luceneIndex");
Directory myindex=SimpleFSDirectory.open(indexDir);
//dataDir is the directory that hosts the text files that to be indexed
File dataDir = new File("I:\\luceneTextFiles");
Analyzer luceneAnalyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
File[] dataFiles = dataDir.listFiles();
IndexWriterConfig indexConfig=new IndexWriterConfig(Version.LUCENE_CURRENT,luceneAnalyzer);
IndexWriter indexWriter = new IndexWriter(myindex, indexConfig);
long startTime = new Date().getTime();
System.out.println("Total file number is "+dataFiles.length+"");
for(int i = 0; i < dataFiles.length; i++){
if(dataFiles[i].isFile() && dataFiles[i].getName().endsWith(".txt")){
org.jsoup.nodes.Document t=Jsoup.parse(dataFiles[i], "UTF-8");
Document document = new Document();
Reader txtReader = new FileReader(dataFiles[i]);
document.add(new Field("title",t.title(),Field.Store.YES,Field.Index.ANALYZED));
document.add(new Field("path",dataFiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.NOT_ANALYZED));
document.add(new Field("count",i+"",Field.Store.YES,Field.Index.NOT_ANALYZED));
document.add(new Field("contents",txtReader));
indexWriter.addDocument(document);
}
}
//indexWriter.getCommitData();
indexWriter.close();
long endTime = new Date().getTime();
String queryKey="duquesne";
String subqueryKey="university";
String queryField="contents";
String subqueryField="title";
/*
* 0------>normal search
* 1------>range search
* 2------>prefix search
* 3------>combine search
* 4------>phrase query
* 5------>wild card query
* 6------>fuzzy query
*/
int querychoice=0;
//initialize the directory
File indexDir=new File("F:\\luceneIndex");
Directory directory=SimpleFSDirectory.open(indexDir);
IndexReader reader=IndexReader.open(directory);
//initialize the searcher
IndexSearcher searcher=new IndexSearcher(reader);
Analyzer analyzer=new StandardAnalyzer(Version.LUCENE_CURRENT);
Query query;
switch(querychoice){
case 0:
QueryParser parser=new QueryParser(Version.LUCENE_CURRENT,subqueryField,analyzer);
query=parser.parse(queryKey);
break;
答案 0 :(得分:1)
嗯,也许是因为university
搜索关键字和Univeristy
不是同一个词?或者你只是在你的问题中拼错了吗?
答案 1 :(得分:1)
使用标准分析器解析title:Duquesne Univeristy - Google Search
将导致查询title:duquesne defaultfield:univeristy defaultfield:google defaultfield:search
,而条件是OR连接。