当自定义分析器明确不使用Elasticsearch小写过滤器时,

时间:2019-09-17 14:24:02

标签: elasticsearch

我基本上是试图禁用小写过滤器,以便能够对文本字段进行区分大小写的匹配。在索引和分析器文档之后,我创建了以下不带小写过滤器的映射:

输入 / my_index

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "asciifolding"
          ]
        }
      }
    }
  }
}

启用字段数据,以便以后可以检查标记化

PUT my_index / _mapping / _doc

{
  "properties": {
    "my_field": { 
      "type":     "text",
      "fielddata": true
    }
  }
}

我测试了自定义分析器,以确保它不像预期的那样小写

POST / my_index / analyze

{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà Vu</b>?"
}

得到以下响应

{
  "tokens": [
    {
      "token": "Is",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "this",
      "start_offset": 3,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "deja",
      "start_offset": 11,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "Vu",
      "start_offset": 16,
      "end_offset": 22,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

太好了,事情并没有像我想要的那样变成小写。所以现在我尝试插入相同的文本,看看会发生什么。

POST / my_index / _doc

{
  "my_field": "Is this <b>déjà Vu</b>?"
}

并尝试对其进行查询

POST / my_index / _search

{
  "query": {
    "regexp": {
      "my_field": "Is.*"
    }
  },
  "docvalue_fields": [
    "my_field"
  ]
}

,没有任何点击。现在,如果我尝试使用小写的正则表达式,我会得到

POST / my_index / _search

{
  "query": {
    "regexp": {
      "my_field": "is.*"
    }
  },
  "docvalue_fields": [
    "my_field"
  ]
}

返回

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "6d6PP20BXDCQSINU0RC_",
        "_score": 1,
        "_source": {
          "my_field": "Is this <b>déjà Vu</b>?"
        },
        "fields": {
          "my_field": [
            "b",
            "déjà",
            "is",
            "this",
            "vu"
          ]
        }
      }
    ]
  }
}

在我看来,由于只有小写的正则表达式匹配,并且文档值全部以小写形式返回,因此某些地方小写仍在变小写。我在这里做什么错了?

1 个答案:

答案 0 :(得分:1)

到目前为止良好的开端!!!

唯一的问题是您没有将自定义分析器应用于字段。将您的映射更改为此,它将使您更进一步。

PUT my_index/_mapping/_doc
{
  "properties": {
    "my_field": { 
      "type":     "text",
      "fielddata": true,
      "analyzer": "my_custom_analyzer"       <-- add this
    }
  }
}