Solr:如何解决这个用户案例

时间:2015-03-04 09:23:09

标签: solr lucene lucene.net solr4

在Solr 4. *中,假设我有字段"mytext"

  1. " mytext"中的第一个记录是" working at ABC"。

  2. " mytext"中的第二条记录是" working at ABC project ABC"。

  3. 现在当我搜索" Working at ABC"时,文档序列

    文件1 :" Working at ABC project ABC"

    文件2 :" Working at ABC"

    虽然根据计算它是有道理的,但第二个文件应该在顶部,因为它包含" ABC"两次(第二个文件的TF更高)。

    但是从用户的角度来看,当查询进入"在ABC"结果应该是

    "Working at ABC"
    
    "Working at ABC project ABC"
    

    我该如何处理这种情况。 此项目仅在"公司"和"项目"有重叠的数据。就像在这种情况下它的" ABC"

    由于

    Amit Aggarwal

2 个答案:

答案 0 :(得分:0)

您可以为该字段设置omitTermFreqsAndPositions=true。只要包含规范,内容较短的字段的排名将高于内容较长的字段。

答案 1 :(得分:0)

而不是更改schema.xml。我覆盖了总是返回1的TF函数。因此没有术语频率的影响。

如果有人在短字段上使用Solr,那么这里是我的自定义类

private static float ARR[] = { 0.0f, 1.5f, 1.25f, 1.0f, 0.875f, 0.75f, 0.625f, 0.5f, 0.4375f, 0.375f, 0.3125f};

  /** 
   * Implemented as a lookup for the first 10 counts, then
   * <code>1/sqrt(numTerms)</code>. This is to avoid term counts below
   * 11 from having the same lengthNorm after being stored encoded as
   * a single byte.
   */
  public float lengthNorm(FieldInvertState state) {
    int numTerms = state.getLength();
    String fieldName = state.getName();

    System.out.println("field is " + fieldName  + "   number of terms are  " + numTerms);
    if( numTerms <= 10 ) {
      // this shouldn't be possible, but be safe.
      if( numTerms < 0 ) { numTerms = 0; }

      return ARR[numTerms];
    }
    //else
    return (float)(1.0 / Math.sqrt(numTerms));
  }

  // For short fields , term frequency does not always lead to relevancy so returning 1.0 
  @Override
  public float tf(float freq) {
      return (float) 1.0;
  }