从Lucene查询中获取匹配的术语

时间:2011-10-25 21:35:08

标签: lucene

鉴于Lucene搜索查询如:+(letter:A letter:B letter:C) +(style:Capital),如何判断三个字母中哪一个与任何给定文档实际匹配?我不在乎他们匹配的地方,或者他们匹配的次数,我只需要知道他们是否匹配。

目的是获取初始查询(“A B C”),删除成功匹配的术语(A和B),然后对余数(C)进行进一步处理。

5 个答案:

答案 0 :(得分:10)

虽然样本在c#中,但Lucene API非常相似(一些大小写差异)。我认为翻译成java并不难。

这是用法

List<Term> terms = new List<Term>();    //will be filled with non-matched terms
List<Term> hitTerms = new List<Term>(); //will be filled with matched terms
GetHitTerms(query, searcher,docId, hitTerms,terms);

这是方法

void GetHitTerms(Query query,IndexSearcher searcher,int docId,List<Term> hitTerms,List<Term>rest)
{
    if (query is TermQuery)
    {
        if (searcher.Explain(query, docId).IsMatch() == true) 
            hitTerms.Add((query as TermQuery).GetTerm());
        else
            rest.Add((query as TermQuery).GetTerm());
        return;
    }

    if (query is BooleanQuery)
    {
        BooleanClause[] clauses = (query as BooleanQuery).GetClauses();
        if (clauses == null) return;

        foreach (BooleanClause bc in clauses)
        {
            GetHitTerms(bc.GetQuery(), searcher, docId,hitTerms,rest);
        }
        return;
    }

    if (query is MultiTermQuery)
    {
        if (!(query is FuzzyQuery)) //FuzzQuery doesn't support SetRewriteMethod
            (query as MultiTermQuery).SetRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);

        GetHitTerms(query.Rewrite(searcher.GetIndexReader()), searcher, docId,hitTerms,rest);
    }
}

答案 1 :(得分:1)

作为@ L.B给出的答案,以下是适用于我的JAVA的转换代码:

void GetHitTerms(Query query,IndexSearcher searcher,int docId,List<Term> hitTerms,List<Term>rest) throws IOException
    {
        if(query instanceof TermQuery )
        {
            if (searcher.explain(query, docId).isMatch())
                hitTerms.add(((TermQuery) query).getTerm());
            else
                rest.add(((TermQuery) query).getTerm());
            return;
        }

            if(query instanceof BooleanQuery )
            {
                for (BooleanClause clause : (BooleanQuery)query) {
                    GetHitTerms(clause.getQuery(), searcher, docId,hitTerms,rest);
            }
            return;
        }

        if (query instanceof MultiTermQuery)
        {
            if (!(query instanceof FuzzyQuery)) //FuzzQuery doesn't support SetRewriteMethod
                ((MultiTermQuery)query).setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);

            GetHitTerms(query.rewrite(searcher.getIndexReader()), searcher, docId,hitTerms,rest);
        }
    }

答案 2 :(得分:1)

我基本上使用与@L.B相同的方法,但是将其更新以用于最新的Lucene版本7.4.0。注意:FuzzyQuery现在支持.setRewriteMethod(这就是为什么我删除了if的原因。)

我还包括处理BoostQuerys并将Lucene发现的单词保存在HashSet中,以避免重复,而不是条款。

private void saveHitWordInList(Query query, IndexSearcher indexSearcher,
    int docId, HashSet<String> hitWords) throws IOException {
  if (query instanceof TermQuery)
    if (indexSearcher.explain(query, docId).isMatch())
      hitWords.add(((TermQuery) query).getTerm().toString().split(":")[1]);
  if (query instanceof BooleanQuery) {
    for (BooleanClause clause : (BooleanQuery) query) {
      saveHitWordInList(clause.getQuery(), indexSearcher, docId, hitWords);
    }
  }

  if (query instanceof MultiTermQuery) {
    ((MultiTermQuery) query)
        .setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_REWRITE);
    saveHitWordInList(query.rewrite(indexSearcher.getIndexReader()),
        indexSearcher, docId, hitWords);
  }

  if (query instanceof BoostQuery)
    saveHitWordInList(((BoostQuery) query).getQuery(), indexSearcher, docId,
        hitWords);
}

答案 3 :(得分:1)

这是Lucene.NET 4.8的简化且非递归版本。
未验证,但这在Lucene.NET 3.x上也应适用

IEnumerable<Term> GetHitTermsForDoc(Query query, IndexSearcher searcher, int docId)
{
    //Rewrite query into simpler internal form, required for ExtractTerms
    var simplifiedQuery = query.Rewrite(searcher.IndexReader);
    HashSet<Term> queryTerms = new HashSet<Term>();
    simplifiedQuery.ExtractTerms(queryTerms);

    List<Term> hitTerms = new List<Term>();
    foreach (var term in queryTerms)
    {
        var termQuery = new TermQuery(term);

        var explanation = searcher.Explain(termQuery, docId);
        if (explanation.IsMatch)
        {
            hitTerms.Add(term);
        }
    }
    return hitTerms;
}

答案 4 :(得分:0)

您可以为每个单独的字词使用cached filter,并根据BitSets快速检查每个文档ID。