PhraseQuery在Apache lucene 7.2.1中不起作用

时间:2018-01-30 06:27:38

标签: java lucene

我是Apache Lucene的新手。我正在使用Apache Lucene v7.2.1。 我需要在一个巨大的文件中进行短语搜索。我首先使用PhraseQuery制作了一个示例代码,以便在Lucene中找出短语搜索功能。但它不起作用。 我的代码如下:

public class LuceneExample 
{

  private static final String INDEX_DIR = "myIndexDir";
  // function to create index writer
  private static IndexWriter createWriter() throws IOException
  {
    FSDirectory dir = FSDirectory.open(Paths.get(INDEX_DIR));
    IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
    IndexWriter writer = new IndexWriter(dir, config);
    return writer;
  }
// function to create the index document.
  private static Document createDocument(Integer id, String source, String target)
  {
    Document document = new Document();
    document.add(new StringField("id", id.toString() , Store.YES));
    document.add(new TextField("source", source , Store.YES));
    document.add(new TextField("target", target , Store.YES));
    return document;
  }

  // function to do index search by source
  private static TopDocs searchBySource(String source, IndexSearcher searcher) throws Exception
  {        
      // phrase query build
    PhraseQuery.Builder builder = new PhraseQuery.Builder();
    String[] words = source.split(" ");
    int ii = 0;
    for (String word : words) {
        builder.add(new Term("source", word), ii);
        ii = ii + 1;
    }
    PhraseQuery query = builder.build();
    System.out.println(query);
    // phrase search
    TopDocs hits = searcher.search(query, 10);
    return hits;
  }

  public static void main(String[] args) throws Exception 
  {
    // TODO Auto-generated method stub
    // create index writer
    IndexWriter writer = createWriter();
    //create documents object
    List<Document> documents = new ArrayList<>();

    String src = "Negotiation Skills are focused on resolving differences for the benefit of an individual or a group , or to satisfy various interests.";
    String tgt = "Modified target : Negotiation Skills are focused on resolving differences for the benefit of an individual or a group, or to satisfy various interests.";
    Document d1 = createDocument(1, src, tgt);
    documents.add(d1);

    src = "This point may benefit all of the participating entities, or just a single party, some of them, or all of them.";
    tgt = "Modified target : This point may benefit all of the participating entities, or just a single party, some of them, or all of them.";
    Document d2 = createDocument(2, src, tgt);
    documents.add(d2);

    writer.deleteAll();

    // adding documents to index writer
    writer.addDocuments(documents);
    writer.commit();
    writer.close();

    // for index searching

    Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
    IndexReader reader = DirectoryReader.open(dir);
    IndexSearcher searcher = new IndexSearcher(reader);

    //Search by source
    TopDocs foundDocs = searchBySource("benefit of an individual", searcher);
    System.out.println("Total Results count :: " + foundDocs.totalHits);
  }

}

当我搜索上面提到的字符串“个人的利益”时。总结果计数为“0”。但它存在于文件1中。如果我能在解决这个问题上得到任何帮助,那就太好了。 提前谢谢。

1 个答案:

答案 0 :(得分:4)

让我们从摘要开始:

  • 在索引时您正在使用带有英文停用词的标准分析器
  • 在查询时您使用自己的分析,没有停用词和删除特殊字符

规则在索引和查询时使用相同的分析链

以下是简化和“正确”查询处理的示例:

  // function to do index search by source
  private static TopDocs searchBySource(String source, IndexSearcher searcher) throws Exception {
    // phrase query build
    PhraseQuery.Builder builder = new PhraseQuery.Builder();
    TokenStream tokenStream = new StandardAnalyzer().tokenStream("source", source);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
      CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
      builder.add(new Term("source", charTermAttribute.toString()));
    }
    tokenStream.end();
    tokenStream.close();
    builder.setSlop(2);
    PhraseQuery query = builder.build();
    System.out.println(query);
    // phrase search
    TopDocs hits = searcher.search(query, 10);
    return hits;
  }

为了简单起见,我们可以通过使用带有空停用词列表的构造函数来从标准分析器中删除停用词 - 并且一切都会如您预期的那样简单。您可以详细了解停用词和词组查询here

短语查询的所有问题都是从停用词开始的。在引擎盖下,Lucene保留所有单词的位置,包括在特殊索引中的停用词 -  期限职位。在某些情况下,将“目标”和“目标”分开是有用的。如果是短语查询 - 它会尝试考虑术语位置。例如,我们有一个术语“黑色和白色”,带有停用词“和”。在这种情况下,Lucene索引将有两个术语“黑色”与位置1和“白色”与位置3.天真短语查询“黑色白色”不应匹配任何东西,因为它不允许术语位置的差距。创建正确的查询有两种可能的策略:

  • “black?white” - 为每个停用词使用特殊标记。这将匹配“黑与白”和“黑或白”
  • “black white”~1 - 允许匹配术语位置的差距。 “黑色或白色”也是可能的。斜坡2和更多“白色和黑色”也是可能的。

为了创建正确的查询,您可以在查询处理中使用以下术语属性:

PositionIncrementAttribute positionIncrementAttribute = tokenStream.getAttribute(PositionIncrementAttribute.class);

我使用setSlop(2)来简化代码段,您可以根据查询长度设置slop因子或在词组构建器中放置正确的术语位置。我的建议是不要使用停用词,你可以阅读关于停用词here