Lucene:文件存在,而canot使用QueryParser获取它

时间:2014-01-14 08:30:12

标签: lucene

基本上我索引了85k个html文件(谷歌结果页面和关键词是不同的大学名称),并且我在每个lucene索引中使用每个页面的标题作为名为“title”的字段。当我搜索关键词如 “duquesne AND university” 时,没有结果出来,但是,当我将关键词更改为 “duquesne”< / em> ,我可以获得标题结果:“标题:Duquesne Univeristy - Google搜索” 为什么会这样?从第二次尝试我可以告诉这个标题为Duquesne Univeristy的文件被编入索引,但我无法从第一次尝试获得它。 很多Thx!〜

以下是构建索引的代码,我使用Jsoup从网页获取标题:

//indexDir is the directory that hosts Lucene's index files 
     File   indexDir = new File("F:\\luceneIndex"); 

     Directory myindex=SimpleFSDirectory.open(indexDir);
     //dataDir is the directory that hosts the text files that to be indexed 
     File   dataDir  = new File("I:\\luceneTextFiles"); 
     Analyzer luceneAnalyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); 
     File[] dataFiles  = dataDir.listFiles(); 
     IndexWriterConfig indexConfig=new IndexWriterConfig(Version.LUCENE_CURRENT,luceneAnalyzer);
     IndexWriter indexWriter = new IndexWriter(myindex, indexConfig); 
     long startTime = new Date().getTime(); 
     System.out.println("Total file number is  "+dataFiles.length+"");
     for(int i = 0; i < dataFiles.length; i++){ 
          if(dataFiles[i].isFile() && dataFiles[i].getName().endsWith(".txt")){
               org.jsoup.nodes.Document t=Jsoup.parse(dataFiles[i], "UTF-8");                  
               Document document = new Document(); 
               Reader txtReader = new FileReader(dataFiles[i]); 
               document.add(new Field("title",t.title(),Field.Store.YES,Field.Index.ANALYZED));
               document.add(new Field("path",dataFiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.NOT_ANALYZED)); 
               document.add(new Field("count",i+"",Field.Store.YES,Field.Index.NOT_ANALYZED));
               document.add(new Field("contents",txtReader)); 
               indexWriter.addDocument(document); 

          } 
     } 

     //indexWriter.getCommitData();
     indexWriter.close(); 
     long endTime = new Date().getTime(); 

String queryKey="duquesne";
        String subqueryKey="university";
        String queryField="contents";
        String subqueryField="title";
        /*
         * 0------>normal search
         * 1------>range search
         * 2------>prefix search
         * 3------>combine search
         * 4------>phrase query
         * 5------>wild card query
         * 6------>fuzzy query
         */
        int querychoice=0;

        //initialize the directory
        File indexDir=new File("F:\\luceneIndex");
        Directory directory=SimpleFSDirectory.open(indexDir);
        IndexReader reader=IndexReader.open(directory);
        //initialize the searcher
        IndexSearcher searcher=new IndexSearcher(reader);
        Analyzer analyzer=new StandardAnalyzer(Version.LUCENE_CURRENT);
        Query query;
        switch(querychoice){

        case 0:
            QueryParser parser=new QueryParser(Version.LUCENE_CURRENT,subqueryField,analyzer);
            query=parser.parse(queryKey);
            break;

2 个答案:

答案 0 :(得分:1)

嗯,也许是因为university搜索关键字和Univeristy不是同一个词?或者你只是在你的问题中拼错了吗?

答案 1 :(得分:1)

使用标准分析器解析title:Duquesne Univeristy - Google Search将导致查询title:duquesne defaultfield:univeristy defaultfield:google defaultfield:search,而条件是OR连接。