Elasticsearch将两个单词合并为一个

时间:2019-06-04 12:47:37

标签: elasticsearch

我有一个字段ManufacturerName

"ManufacturerName": {
    "type": "keyword",
    "normalizer" : "keyword_lowercase"
},

还有一个规范化器

"normalizer": {
    "keyword_lowercase": {
       "type": "custom",
       "filter": ["lowercase"]
    }
}

在搜索“ ripcurl”时会匹配。但是,当搜索“撕裂卷曲”时不会。

如何/用什么方式连接某些单词。即'rip curl'->'ripcurl'

很抱歉,如果重复的话,我已经花了一些时间寻求解决方案。

1 个答案:

答案 0 :(得分:1)

您想利用text字段来查找所需内容,并通过Ngram Tokenizer来实现这种要求

以下是示例映射,查询和响应:

映射:

PUT mysomeindex
{
  "mappings": {
    "mydocs":{
      "properties": { 
        "ManufacturerName":{
          "type": "text",
          "analyzer": "my_analyzer", 
          "fields":{
            "keyword":{
              "type": "keyword",
              "normalizer": "my_normalizer"
            }
          }
        }
      }
    }
  }, 
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer":{
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer",
          "filter": [ "synonyms" ]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 5,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      },
      "filter": {
        "synonyms":{
          "type": "synonym",
          "synonyms" : ["henry loyd, henry loid, henry lloyd => henri lloyd"]
        }
      }
    }
  }
}

请注意,字段ManufacturerNamemulti-field,它既具有text类型又具有其同级keyword类型。这样,对于完全匹配和聚合查询,您可以使用keyword字段,而对于此要求,您可以使用text字段。

示例文档:

POST mysomeindex/mydocs/1
{
  "ManufacturerName": "ripcurl"
}

POST mysomeindex/mydocs/2
{
  "ManufacturerName": "henri lloyd"
}

当您摄取上述文档时,elasticsearch的作用是,它会创建长度为35的令牌,并将其存储在反向索引中,例如`rip,ipc,pcu等...

您可以执行以下查询以查看创建了哪些令牌:

POST mysomeindex/_analyze
{
  "text": "ripcurl",
  "analyzer": "my_analyzer"
}

我也建议您研究Edge Ngram令牌生成器,看看它是否更适合您的要求。

查询:

POST mysomeindex/_search
{
  "query": {
    "match": {
      "ManufacturerName": "rip curl"
    }
  }
}

响应:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.25316024,
    "hits": [
      {
        "_index": "mysomeindex",
        "_type": "mydocs",
        "_id": "1",
        "_score": 0.25316024,
        "_source": {
          "ManufacturerName": "ripcurl"
        }
      }
    ]
  }
}

查询同义词:

POST mysomeindex/_search
{
  "query": {
    "match": {
      "ManufacturerName": "henri lloyd"
    }
  }
}

响应:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 2.2784421,
    "hits": [
      {
        "_index": "mysomeindex",
        "_type": "mydocs",
        "_id": "2",
        "_score": 2.2784421,
        "_source": {
          "ManufacturerName": "henry lloyd"
        }
      }
    ]
  }
}

注意::如果您打算使用同义词,则最好的方法是将它们包含在文本文件中,并相对于config文件夹位置添加该同义词,如{{ 3}}

希望这会有所帮助!