如何将pattern_replace char_filter与同义词过滤器结合使用(跳过同义词)?

时间:2018-10-04 23:01:15

标签: elasticsearch

我有一个索引,试图在该索引上强制添加开始标记和结束标记。 (大图,我正在尝试使用match_phrase来匹配地址的整个短语-不仅仅是子短语)。我有一个工作正常的char_filter可以完成此任务,但是它似乎导致同义词过滤器出现问题。这些都是在ElasticSearch 6.2.14上完成的。

这是我要展示的最小的作品:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "working_street_analyzer": {
          "type": "custom",
          "char_filter": [
            "html_strip"
          ],
          "tokenizer": "classic",
          "filter": [
            "street_synonyms"
          ]
        },
        "broken_street_analyzer": {
          "type": "custom",
          "char_filter": [
            "extraTokenAtEnds"
          ],
          "tokenizer": "classic",
          "filter": [
            "street_synonyms"
          ]
        }
      },
      "char_filter": {
        "extraTokenAtEnds": {
          "type": "pattern_replace",
          "pattern": "^(.*)$",
          "replacement": "wordyword $1 wordyword"
        }
      },
      "filter": {
        "street_synonyms": {
          "type": "synonym",
          "synonyms": [
            "south, s",
            "west, w"
          ]
        }
      }
    }
  }
}

以下是针对“ _analyze”端点的两项检查:

{
   "analyzer": "working_street_analyzer",
   "text":     "40 s 50 w"
}

{
   "analyzer": "broken_street_analyzer",
   "text":     "40 s 50 w"
}

working_street_analyzer提供您所期望的:

{
    "tokens": [
        {
            "token": "40",
            "start_offset": 0,
            "end_offset": 2,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "s",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "south",
            "start_offset": 3,
            "end_offset": 4,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "50",
            "start_offset": 5,
            "end_offset": 7,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "w",
            "start_offset": 8,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "west",
            "start_offset": 8,
            "end_offset": 9,
            "type": "SYNONYM",
            "position": 3
        }
    ]
}

broken_street_analyzer省略了同义词步骤。使用“ explain”:“ true”运行_analyze表示同义词步骤确实已运行,只是找不到任何同义词:

{
    "tokens": [
        {
            "token": "wordyword",
            "start_offset": 0,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "40",
            "start_offset": 8,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "s",
            "start_offset": 8,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "50",
            "start_offset": 8,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "w",
            "start_offset": 8,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "wordyword",
            "start_offset": 8,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 5
        }
    ]
}

1 个答案:

答案 0 :(得分:1)

似乎synonym令牌过滤器依赖于所生成令牌的偏移量,但是pattern_replace字符过滤器打破了偏移量:令牌{{1}的字段start_offsetend_offset },40s50w输出中具有相同的值。

这在Apache Lucene中是众所周知的issue,它是Elasticsearch的基础层。 broken_street_analyzer产生的错误偏移还会在Elasticsearch中引起其他错误,例如在结果突出显示中–您可以清楚地explanation来了解发生这种情况的原因。