Question

我正在使用 BaseTokenStreamTestCase 对自定义TokenFilter执行某些测试。

测试以无法解释的方式失败。您可以从我的调试输出中看到它正在抱怨的令牌，其endOffset为17 ...

不一致的endOffset 1 pos = 1 posLen = 1 token = hello expected：＆lt; 11＆gt;但是：＆lt; 17＆gt;

   original: wheel chair hello there foo bar
  increment:      1        1     1      1   
     tokens: wheel chair hello there foo bar
  positions: ----------- ----- ----- -------
    lengths:      2        1     1      2   
   sequence:      1        2     3      4   
             0123456789012345678901234567890
                      10        20        30
  start-end: 1:[0-11], 2:[12-17], 3:[18-23], 4:[24-31]

见下测试代码：

assertAnalyzesTo(analyzer, input,
        new String[] {"wheel chair", "hello", "there", "foo bar"},
        new int[] {0, 12, 18, 24},  // start offsets
        new int[] {11, 17, 23, 31}, // end offsets
        null,                       // types
        new int[] {1, 1, 1, 1},     // positionIncrement
        new int[] {2, 1, 1, 2});    // positionLength

为什么它认为第二个令牌应该以{{1}}结束？

Answer 1

BaseTokenStreamTestCase正在从此源生成错误： ......第248行附近

  final int endPos = pos + posLength;

  if (!posToEndOffset.containsKey(endPos)) {
    // First time we've seen a token arriving to this position:
    posToEndOffset.put(endPos, endOffset);
    //System.out.println("  + e " + endPos + " -> " + endOffset);
  } else {
    // We've seen a token arriving to this position
    // before; verify the endOffset is the same:
    //System.out.println("  + ve " + endPos + " -> " + endOffset);
    assertEquals("inconsistent endOffset " + i + " pos=" + pos + " posLen=" + posLength + " token=" + termAtt, posToEndOffset.get(endPos).intValue(), endOffset);
  }

由于 endPos 计算为pos + posLength，因此测试假定为posToEndOffset.get(endPos) id将返回当前令牌位置+长度的结束位置偏移量。

这意味着它的预读1令牌，因为第一个令牌的长度= 2 。这就是测试失败的原因。长度使用不当。

将长度属性设置为其默认值1更正了测试错误。

Lucene / Solr测试不一致的endOffset

1 个答案: