使用Lucene短语查询和PDFBOX搜索PDF中的句子

时间:2014-01-15 12:08:51

标签: lucene

我使用以下代码在pdf中搜索文本。它用单字工作得很好。但是对于代码中提到的句子,它表明即使文档中存在文本也不存在。任何人都可以帮我解决这个问题吗?

          Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

            // Store the index in memory:               
            Directory directory = new RAMDirectory();
            // To store an index on disk, use this instead:
            //Directory directory = FSDirectory.open("/tmp/testindex");
            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
            IndexWriter iwriter = new IndexWriter(directory, config);
            Document doc = new Document();
            PDDocument document = null;
                try {
                    document = PDDocument.load(strFilepath);
                } 
                catch (IOException ex) {
                    System.out.println("Exception Occured while Loading the document: " + ex);
                }
                int i =1;
                String name = null;           
              String output=new PDFTextStripper().getText(document); 
            //String text = "This is the text to be indexed";
            doc.add(new Field("contents", output, TextField.TYPE_STORED));
            iwriter.addDocument(doc);
            iwriter.close();
            // Now search the index
            DirectoryReader ireader = DirectoryReader.open(directory);
            IndexSearcher isearcher = new IndexSearcher(ireader);
            // Parse a simple query that searches for "text":
            QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", analyzer);

            String sentence = "Following are the";
            PhraseQuery query = new PhraseQuery();
            String[] words = sentence.split(" ");
            for (String word : words) {
               query.add(new Term("contents", word));
            }
            ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
            if(hits.length>0){
                System.out.println("Searched text existed in the PDF.");
            }
            ireader.close();
            directory.close();
         }
         catch(Exception e){
             System.out.println("Exception: "+e.getMessage());
         }
 }

1 个答案:

答案 0 :(得分:0)

您应该使用查询解析器从您的句子创建查询,而不是自己创建您的短语查询。你自己创建的查询包含术语“跟随”,它没有被索引,因为标准分析器在索引期间会将其小写,因此只有“跟随”被索引。