在Solr 4. *中,假设我有字段"mytext"
。
" mytext
"中的第一个记录是" working at ABC
"。
" mytext
"中的第二条记录是" working at ABC project ABC
"。
现在当我搜索" Working at ABC
"时,文档序列
文件1 :" Working at ABC project ABC
"
文件2 :" Working at ABC
"
虽然根据计算它是有道理的,但第二个文件应该在顶部,因为它包含" ABC"两次(第二个文件的TF更高)。
但是从用户的角度来看,当查询进入"在ABC"结果应该是
"Working at ABC"
"Working at ABC project ABC"
我该如何处理这种情况。 此项目仅在"公司"和"项目"有重叠的数据。就像在这种情况下它的" ABC" 。
由于
Amit Aggarwal
答案 0 :(得分:0)
您可以为该字段设置omitTermFreqsAndPositions=true
。只要包含规范,内容较短的字段的排名将高于内容较长的字段。
答案 1 :(得分:0)
而不是更改schema.xml。我覆盖了总是返回1的TF函数。因此没有术语频率的影响。
如果有人在短字段上使用Solr,那么这里是我的自定义类
private static float ARR[] = { 0.0f, 1.5f, 1.25f, 1.0f, 0.875f, 0.75f, 0.625f, 0.5f, 0.4375f, 0.375f, 0.3125f};
/**
* Implemented as a lookup for the first 10 counts, then
* <code>1/sqrt(numTerms)</code>. This is to avoid term counts below
* 11 from having the same lengthNorm after being stored encoded as
* a single byte.
*/
public float lengthNorm(FieldInvertState state) {
int numTerms = state.getLength();
String fieldName = state.getName();
System.out.println("field is " + fieldName + " number of terms are " + numTerms);
if( numTerms <= 10 ) {
// this shouldn't be possible, but be safe.
if( numTerms < 0 ) { numTerms = 0; }
return ARR[numTerms];
}
//else
return (float)(1.0 / Math.sqrt(numTerms));
}
// For short fields , term frequency does not always lead to relevancy so returning 1.0
@Override
public float tf(float freq) {
return (float) 1.0;
}