Question

在Solr 4. *中，假设我有字段"mytext"。

＆＃34; mytext＆＃34;中的第一个记录是＆＃34; working at ABC＆＃34;。
＆＃34; mytext＆＃34;中的第二条记录是＆＃34; working at ABC project ABC＆＃34;。

现在当我搜索＆＃34; Working at ABC＆＃34;时，文档序列

文件1 ：＆＃34; Working at ABC project ABC＆＃34;

文件2 ：＆＃34; Working at ABC＆＃34;

虽然根据计算它是有道理的，但第二个文件应该在顶部，因为它包含＆＃34; ABC＆＃34;两次（第二个文件的TF更高）。

但是从用户的角度来看，当查询进入＆＃34;在ABC＆＃34;结果应该是

"Working at ABC"

"Working at ABC project ABC"

我该如何处理这种情况。 此项目仅在＆＃34;公司＆＃34;和＆＃34;项目＆＃34;有重叠的数据。就像在这种情况下它的＆＃34; ABC＆＃34; 。

由于

Amit Aggarwal

Answer 1

您可以为该字段设置omitTermFreqsAndPositions=true。只要包含规范，内容较短的字段的排名将高于内容较长的字段。

Answer 2

而不是更改schema.xml。我覆盖了总是返回1的TF函数。因此没有术语频率的影响。

如果有人在短字段上使用Solr，那么这里是我的自定义类

private static float ARR[] = { 0.0f, 1.5f, 1.25f, 1.0f, 0.875f, 0.75f, 0.625f, 0.5f, 0.4375f, 0.375f, 0.3125f};

  /** 
   * Implemented as a lookup for the first 10 counts, then
   * <code>1/sqrt(numTerms)</code>. This is to avoid term counts below
   * 11 from having the same lengthNorm after being stored encoded as
   * a single byte.
   */
  public float lengthNorm(FieldInvertState state) {
    int numTerms = state.getLength();
    String fieldName = state.getName();

    System.out.println("field is " + fieldName  + "   number of terms are  " + numTerms);
    if( numTerms <= 10 ) {
      // this shouldn't be possible, but be safe.
      if( numTerms < 0 ) { numTerms = 0; }

      return ARR[numTerms];
    }
    //else
    return (float)(1.0 / Math.sqrt(numTerms));
  }

  // For short fields , term frequency does not always lead to relevancy so returning 1.0 
  @Override
  public float tf(float freq) {
      return (float) 1.0;
  }

Solr：如何解决这个用户案例

2 个答案: