Question

据我了解，给定文档的字段长度是给定文档字段中索引的术语数。但是，字段长度似乎永远不是整数。例如，我在内容字段中看到了一个包含两个术语的文档，但是Solr计算出的内容字段长度实际上是2.56，而不是我预期的2。如何在Solr / Lucene中计算字段长度？

我指的是根据BM25相似度函数计算得分时使用的字段长度，但我认为正在为其他排名方案计算字段长度。

Answer 1

正如我在BM25Similarity的代码中看到的那样：

  public final long computeNorm(FieldInvertState state) {
    final int numTerms = discountOverlaps ? state.getLength() - state.getNumOverlap() : state.getLength();
    return encodeNormValue(state.getBoost(), numTerms);
  }

其中state＃getLength（）是：

  /**
   * Get total number of terms in this field.
   * @return the length
   */
  public int getLength() {
    return length;
  }

实际上，它是一个整数。你能告诉我，你在哪里看到非整数值？ SolrAdmin用户界面？在哪里？

现在，当您发布输出时，我找到了它的来源： source

看看这个：

private Explanation explainTFNorm(int doc, Explanation freq, BM25Stats stats, NumericDocValues norms) {
    List<Explanation> subs = new ArrayList<>();
    subs.add(freq);
    subs.add(Explanation.match(k1, "parameter k1"));
    if (norms == null) {
      subs.add(Explanation.match(0, "parameter b (norms omitted for field)"));
      return Explanation.match(
          (freq.getValue() * (k1 + 1)) / (freq.getValue() + k1),
          "tfNorm, computed from:", subs);
    } else {
      float doclen = decodeNormValue((byte)norms.get(doc));
      subs.add(Explanation.match(b, "parameter b"));
      subs.add(Explanation.match(stats.avgdl, "avgFieldLength"));
      subs.add(Explanation.match(doclen, "fieldLength"));
      return Explanation.match(
          (freq.getValue() * (k1 + 1)) / (freq.getValue() + k1 * (1 - b + b * doclen/stats.avgdl)),
          "tfNorm, computed from:", subs);
    }
  }

因此，按字段长度输出：float doclen = decodeNormValue((byte)norms.get(doc));

 /** The default implementation returns <code>1 / f<sup>2</sup></code>
   * where <code>f</code> is {@link SmallFloat#byte315ToFloat(byte)}. */
  protected float decodeNormValue(byte b) {
    return NORM_TABLE[b & 0xFF];
  }

/** Cache of decoded bytes. */
  private static final float[] NORM_TABLE = new float[256];

  static {
    for (int i = 1; i < 256; i++) {
      float f = SmallFloat.byte315ToFloat((byte)i);
      NORM_TABLE[i] = 1.0f / (f*f);
    }
    NORM_TABLE[0] = 1.0f / NORM_TABLE[255]; // otherwise inf
  }

事实上，查看wikipedia此docLen应该是

a | D |是文字D的长度

Answer 2

通过复杂的数学归一化（编码/解码）方程（基本上压缩32位整数到8位以节省磁盘空间，同时存储数据）计算前一个答案“fieldLength”在SmallFloat类中计算的.java。

这是decodeNormValue（）函数的描述，它计算BM25中的fieldLength：

{@link encodeNormValue（float）的默认评分实现在存储之前将标准值编码为单个字节。在搜索 time，从索引{@link中读取norm字节值 org.apache.lucene.store.Directory目录}和{@link decodeNormValue（long）已解码}返回float norm 值。这种编码/解码虽然减小了索引大小，但随之而来精确损失的价格 - 不保证 decode（encode（x））= x 。例如， decode（encode（0.89））= 0.875 的

希望这有帮助。

如何在Solr / Lucene中定义字段长度？

2 个答案: