我需要一种方法来仅使用术语频率对lucene文档进行评分。是否有任何需要更改的标志?

时间:2016-04-07 11:21:45

标签: lucene

如果我有两个文件,其中D1有术语" lucene"两次,D2有术语" lucene"三次。我希望lucene得分D2高于D1。注意,D1只有两个单词(即lucene lucene),而D3有100个单词,其中3个单词是lucene。默认的lucene评分模型将得分D1高于D2。我想禁用此模式并将D2排名高于D1。这是我的项目要求。

1 个答案:

答案 0 :(得分:3)

你需要实现一个符合你想要的相似性。您可以直接在Similarity上实施,但您可能会发现在版本5.4之前复制ClassicSimilarityDefaultSimilarity)更简单,并且可以删除所有内容你不想影响你的分数(即返回一个常数)。例如,这是一个非常简单的实现,它只返回查询中术语的频率:

import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.search.similarities.TFIDFSimilarity;
import org.apache.lucene.util.BytesRef;

public class SimpleSimilarity extends TFIDFSimilarity {
//Comments describe briefly what these methods do in the *standard* implementation.
//Not what they do in this implementation (which, for most of them, is nothing at all)

  public SimpleSimilarity() {}

  //boosts results which match more query terms
  @Override
  public float coord(int overlap, int maxOverlap) {
    return 1f;
  }

  //constant per query, normalizes scores somewhat based on query
  @Override
  public float queryNorm(float sumOfSquaredWeights) {
    return 1f;
  }

  //Norms should be disabled when using this similarity
  //They are useless to it, and would just be wasted space.
  @Override
  public final long encodeNormValue(float f) {
    return 1L;
  }

  @Override
  public final float decodeNormValue(long norm) {
    return 1f;
  }

  //Weighs shorter fields more heavily
  @Override
  public float lengthNorm(FieldInvertState state) {
    return 1f;
  }

  //Higher frequency terms (more matches) scored higher
  @Override
  public float tf(float freq) {
    //return (float)Math.sqrt(freq);  The standard tf impl
    return freq;
  }

  //Scores closer matches higher when using a sloppy phrase query
  @Override
  public float sloppyFreq(int distance) {
    return 1.0f;
  }

  //ClassicSimilarity doesn't really do much with payloads.  This is unmodified
  @Override
  public float scorePayload(int doc, int start, int end, BytesRef payload) {
    return 1f;
  }

  //Weigh matches on rarer terms more heavily.
  @Override
  public float idf(long docFreq, long numDocs) {
    return 1f;
  }

  @Override
  public String toString() {
    return "SimpleSimilarity";
  }
}