Question

我写了自己的标记器：https://github.com/AdiGabaie/tokenizer

我用这个标记器创建了一个分析器。

当我测试分析仪时，我看到了令牌，所有令牌的“start_offset”和“end_offset”都是0而且所有位置的位置都是1。

如果我删除'autocomplete_filter'，则位置正常（1,2,3 ......），但'start_offset'和'end_offset'仍为0。

我想我应该在我的tokenizer实现中做些什么来修复它？

PUT /aditryings/
{
    "settings": {
        "index" : {
            "analysis" : { 
                "analyzer" : {
                    "my_analyzer" : {
                        "tokenizer" : "phrase_tokenizer",
                        "filter" : ["lowercase","autocomplete_filter"]
                    }
                },
                "filter" : {
                    "autocomplete_filter": {
                        "type": "edge_ngram",
                        "min_gram": 1,
                        "max_gram": 20
                    }
                }
            }
        }
    }, 
    "mappings" : {
        "productes" : {
            "properties" : {
                "id" : { "type" : "long"},
                "productName" : { "type" : "string", "index" : "analyzed", "analyzer": "my_analyzer"}
            }
        }
    }
}

Answer 1

您的tokenizer实现的输出采用添加的属性值的形式，例如在您的tokenizer实现中：

protected CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class);

这是您的代码中使用的一个属性，但Elasticsearch不仅期望表示令牌的属性，还期望start_offset，end_offset和position。通过添加和设置OffsetAttribute的值，您可以正确设置令牌的开始和结束偏移量：

https://lucene.apache.org/core/4_10_4/core/org/apache/lucene/analysis/tokenattributes/OffsetAttribute.html

类似地，PositionIncrementAttribute用于设置位置：

https://lucene.apache.org/core/4_10_4/core/org/apache/lucene/analysis/tokenattributes/PositionIncrementAttribute.html

它的契约在Javadoc中描述，显然0是一个有效值，例如当一个词有多个词干时使用。

对于某些灵感，您可以查看标准的tokenizer实现，它使用所有三种类型的属性（以及令牌类型属性）：

https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java

使用自制Tokenizer时，为什么position，end_offset，start_offset搞砸了？

1 个答案: