how does More_like_this elasticsearch work (into the whole index)

时间:2015-06-26 10:25:53

标签: indexing elasticsearch lucene comparison morelikethis

So first we are getting a list of termVectors, which contain all tokens, then we create a map<token, frequency in the document>. then the method createQueue will determine a score by deleting, stopWords and word which occurs not enough, compute idf, then idf * doc_frequency of a given token which is equals to its token, then we keeping the 25 best one, but after that how does it work? How is it compare to the whole index? I read http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ but that didn't explain it, or I miss the point.

1 个答案:

答案 0 :(得分:1)

它会从每个术语中创建一个TermQuery,并将它们全部放入一个简单的BooleanQuery中,通过之前计算的tfidf得分(boostFactor * myScore / bestScore来增加每个术语,其中boostFactor可以由用户设定)。

以下是the source (version 5.0)

private Query createQuery(PriorityQueue<ScoreTerm> q) {
  BooleanQuery query = new BooleanQuery();
  ScoreTerm scoreTerm;
  float bestScore = -1;

  while ((scoreTerm = q.pop()) != null) {
    TermQuery tq = new TermQuery(new Term(scoreTerm.topField, scoreTerm.word));

    if (boost) {
      if (bestScore == -1) {
        bestScore = (scoreTerm.score);
      }
      float myScore = (scoreTerm.score);
      tq.setBoost(boostFactor * myScore / bestScore);
    }

    try {
      query.add(tq, BooleanClause.Occur.SHOULD);
    }
    catch (BooleanQuery.TooManyClauses ignore) {
      break;
    }
  }
  return query;
}