Question

Iam使用带有短语查询的lucene 4.6版本来搜索PDF中的单词。以下是我的代码。在这里，我能够从PDF获取输出文本也将查询作为内容：“以下是”。但是，点击次数显示为0。有任何建议吗？提前谢谢。

            Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);

            // Store the index in memory:               
            Directory directory = new RAMDirectory();
            // To store an index on disk, use this instead:
            //Directory directory = FSDirectory.open("/tmp/testindex");
            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, analyzer);
            IndexWriter iwriter = new IndexWriter(directory, config);
            iwriter.deleteAll();
            iwriter.commit();
            Document doc = new Document();
            PDDocument document = null;
                try {
                    document = PDDocument.load(strFilepath);
                } 
                catch (IOException ex) {
                    System.out.println("Exception Occured while Loading the document: " + ex);
                }
              String output=new PDFTextStripper().getText(document);
              System.out.println(output);
            //String text = "This is the text to be indexed";
            doc.add(new Field("contents", output, TextField.TYPE_STORED));
            iwriter.addDocument(doc);
            iwriter.close();

            // Now search the index
            DirectoryReader ireader = DirectoryReader.open(directory);
            IndexSearcher isearcher = new IndexSearcher(ireader);
            String sentence = "Following are the";
            //IndexSearcher searcher = new IndexSearcher(directory);
            if(output.contains(sentence)){
                System.out.println("");
            }

           PhraseQuery query = new PhraseQuery();
            String[] words = sentence.split(" ");
            for (String word : words) {
               query.add(new Term("contents", word));
            }

            ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
            // Iterate through the results:
            if(hits.length>0){
                System.out.println("Searched text existed in the PDF.");
            }
            ireader.close();
            directory.close();
         }
         catch(Exception e){
             System.out.println("Exception: "+e.getMessage());
         }

Answer 1

PhraseQuery无效的原因有两个原因

StandardAnalyzer使用ENGLISH_STOP_WORDS_SET，其中包含 a，an，and，as，as，at，be，but，by，for，if，in，into，is，它，它，它们，它们，它们，它们，它们，它们，它们，它们，它们，将来，将会是这些将从{{{索引时3}}。这意味着当你搜索＆＃34;以下是＆＃34;在索引中，并且将无法找到。所以你永远不会得到任何结果PhraseQuery 和永远不会在那里搜索。对此的解决方案是使用此构造函数对此进行索引时，Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46, CharArraySet.EMPTY_SET);会确保TokenStream在编制索引时不会删除TokenStream中的任何字词。
StandardAnalyzer也使用StopFilter，这意味着所有令牌都会被归一化为小写。所以关注将被编入索引跟随，这意味着搜索＆＃34;关注＆＃34;不会给你结果。为此，.toLowerCase()会帮助您解决，只需在句子上使用此功能，您就可以从搜索中获得结果。

另请参阅此LowerCaseFilter，其中指定了Unicode标准附件＃29，后面跟着link。从简要的角度来看，它看起来像APOSTROPHE，QUOTATION MARK，FULL STOP，SMALL COMMA以及在索引时会忽略某些条件下的许多其他角色。

PhraseQuery + Lucene 4.6不适用于PDF Word搜索

1 个答案: