Question

在索引很好时使用规范，我的问题是非常短的字段排名不合适。例如：

doc1 : tf(200) out of 1.000 
doc2 : tf(150) out of 500

doc2会得分更高而且很棒。

问题出在我的时候：

doc3 : tf(3) out of 4

这在我的情况下不是很好，因为这是一个非常罕见的文件，让我们说一个例外。

我已经阅读过KinoSearch或有人建议引入常量以抵消此问题。关于如何仍然充分利用使用规范并避免这个问题的任何想法？

谢谢

Answer 1

您可以创建自己的Similarity课程，扩展DefaultSimilarity，并简单地覆盖lengthNorm方法。默认的lengthNorm实现非常简单：

public float lengthNorm(FieldInvertState state) {
    final int numTerms;
    if (discountOverlaps)
        numTerms = state.getLength() - state.getNumOverlap();
    else
        numTerms = state.getLength();
    return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
}

在您的情况下用任何有意义的算法替换它。实际上，最后一行可能是您真正需要担心的所有修改，尤其是1.0 / Math.sqrt(numTerms)。这里要记住两件事：

规范以非常有损的方式压缩（大约1个十进制数字！）以节省空间。重大差异很重要，小调整往往会迷失方向。
您需要重新编制索引。规范存储在索引时间，而不是在查询时计算。

您可以在架构中set Solr to use your Similarity，例如：

<similarity class="this.is.my.CustomSimilarity"/>

lucene / solr规范：避免短长度字段排名不合适

1 个答案: