如果我有两个文件,其中D1有术语" lucene"两次,D2有术语" lucene"三次。我希望lucene得分D2高于D1。注意,D1只有两个单词(即lucene lucene),而D3有100个单词,其中3个单词是lucene。默认的lucene评分模型将得分D1高于D2。我想禁用此模式并将D2排名高于D1。这是我的项目要求。
答案 0 :(得分:3)
你需要实现一个符合你想要的相似性。您可以直接在Similarity
上实施,但您可能会发现在版本5.4之前复制ClassicSimilarity
(DefaultSimilarity
)更简单,并且可以删除所有内容你不想影响你的分数(即返回一个常数)。例如,这是一个非常简单的实现,它只返回查询中术语的频率:
import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.search.similarities.TFIDFSimilarity;
import org.apache.lucene.util.BytesRef;
public class SimpleSimilarity extends TFIDFSimilarity {
//Comments describe briefly what these methods do in the *standard* implementation.
//Not what they do in this implementation (which, for most of them, is nothing at all)
public SimpleSimilarity() {}
//boosts results which match more query terms
@Override
public float coord(int overlap, int maxOverlap) {
return 1f;
}
//constant per query, normalizes scores somewhat based on query
@Override
public float queryNorm(float sumOfSquaredWeights) {
return 1f;
}
//Norms should be disabled when using this similarity
//They are useless to it, and would just be wasted space.
@Override
public final long encodeNormValue(float f) {
return 1L;
}
@Override
public final float decodeNormValue(long norm) {
return 1f;
}
//Weighs shorter fields more heavily
@Override
public float lengthNorm(FieldInvertState state) {
return 1f;
}
//Higher frequency terms (more matches) scored higher
@Override
public float tf(float freq) {
//return (float)Math.sqrt(freq); The standard tf impl
return freq;
}
//Scores closer matches higher when using a sloppy phrase query
@Override
public float sloppyFreq(int distance) {
return 1.0f;
}
//ClassicSimilarity doesn't really do much with payloads. This is unmodified
@Override
public float scorePayload(int doc, int start, int end, BytesRef payload) {
return 1f;
}
//Weigh matches on rarer terms more heavily.
@Override
public float idf(long docFreq, long numDocs) {
return 1f;
}
@Override
public String toString() {
return "SimpleSimilarity";
}
}