我有一个索引,试图在该索引上强制添加开始标记和结束标记。 (大图,我正在尝试使用match_phrase来匹配地址的整个短语-不仅仅是子短语)。我有一个工作正常的char_filter可以完成此任务,但是它似乎导致同义词过滤器出现问题。这些都是在ElasticSearch 6.2.14上完成的。
这是我要展示的最小的作品:
{
"settings": {
"analysis": {
"analyzer": {
"working_street_analyzer": {
"type": "custom",
"char_filter": [
"html_strip"
],
"tokenizer": "classic",
"filter": [
"street_synonyms"
]
},
"broken_street_analyzer": {
"type": "custom",
"char_filter": [
"extraTokenAtEnds"
],
"tokenizer": "classic",
"filter": [
"street_synonyms"
]
}
},
"char_filter": {
"extraTokenAtEnds": {
"type": "pattern_replace",
"pattern": "^(.*)$",
"replacement": "wordyword $1 wordyword"
}
},
"filter": {
"street_synonyms": {
"type": "synonym",
"synonyms": [
"south, s",
"west, w"
]
}
}
}
}
}
以下是针对“ _analyze”端点的两项检查:
{
"analyzer": "working_street_analyzer",
"text": "40 s 50 w"
}
{
"analyzer": "broken_street_analyzer",
"text": "40 s 50 w"
}
working_street_analyzer提供您所期望的:
{
"tokens": [
{
"token": "40",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "s",
"start_offset": 3,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "south",
"start_offset": 3,
"end_offset": 4,
"type": "SYNONYM",
"position": 1
},
{
"token": "50",
"start_offset": 5,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "w",
"start_offset": 8,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "west",
"start_offset": 8,
"end_offset": 9,
"type": "SYNONYM",
"position": 3
}
]
}
broken_street_analyzer省略了同义词步骤。使用“ explain”:“ true”运行_analyze表示同义词步骤确实已运行,只是找不到任何同义词:
{
"tokens": [
{
"token": "wordyword",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "40",
"start_offset": 8,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "s",
"start_offset": 8,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "50",
"start_offset": 8,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "w",
"start_offset": 8,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "wordyword",
"start_offset": 8,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 5
}
]
}
答案 0 :(得分:1)
似乎synonym
令牌过滤器依赖于所生成令牌的偏移量,但是pattern_replace
字符过滤器打破了偏移量:令牌{{1}的字段start_offset
和end_offset
},40
,s
,50
在w
输出中具有相同的值。
这在Apache Lucene中是众所周知的issue,它是Elasticsearch的基础层。 broken_street_analyzer
产生的错误偏移还会在Elasticsearch中引起其他错误,例如在结果突出显示中–您可以清楚地explanation来了解发生这种情况的原因。