Question

我正在尝试使用以下分析器在 elastic serach 7.1 中实现部分子字符串搜索

PUT my_index-001

{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "autocomplete"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase"
          ]
        }
      },
      "filter": {
        "autocomplete": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 40
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "title": {
          "type": "string",
          "analyzer": "autocomplete",
          "search_analyzer": "autocomplete_search"
        }
      }
    }
  }
}

之后，我尝试将一些示例数据添加到 my_index-001 并键入 doc

    PUT my_index-001/doc/1
    {
      "title": "ABBOT Series LTD 2014"
    }
 
    PUT my_index-001/doc/2
    {
      "title": "ABBOT PLO LTD 2014A"
    }
   
    PUT my_index-001/doc/3
    {
      "title": "ABBOT TXT"
    }
    PUT my_index-001/doc/4
    {
      "title": "ABBOT DMO LTD. 2016-II"
    }

用于执行部分搜索的查询：

GET my_index-001/_search
{
  "query": {
    "match": {
      "title": {
        "query": "ABB",
        "operator": "or"
      }
    }
  }
}

我期待分析器的以下输出

如果我输入 ABB 我应该得到 docid 1,2,3,4
如果我输入 ABB 2014 我应该得到 docid 1,2
如果我输入 ABBO PLO 我应该得到文档 2
如果我输入 TXT，我应该得到 doc 3

使用上述分析器设置，我没有得到预期的结果。如果我在弹性搜索的分析器设置中遗漏了任何内容，请告诉我

Answer 1

您几乎到了那里，但有几个问题。

通过 Kibana Dev Tools 创建索引映射时，URI 和请求正文之间不能有任何空格。第一个代码片段中有空格导致 ES 完全忽略请求正文！所以删除那个空格。
最大 ngram 差异 is set to 1 by default。为了使用您的高 ngram 间隔，您需要明确增加索引级别设置 max_ngram_diff:

PUT my_index-001
{
  "settings": {
    "index": {
      "max_ngram_diff": 40   <--
    },
    ...
  }
}

类型名称在 v7 中已弃用。支持 nGram（小写 ngram）的 g 标记过滤器也是如此。 string 字段类型也是如此！这是更正后的 PUT 请求正文：

PUT my_index-001                  <--- no whitespace after the URI!
{
  "settings": {
    "index": {
      "max_ngram_diff": 40        <--- explicit setting
    },
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "autocomplete"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase"
          ]
        }
      },
      "filter": {
        "autocomplete": {
          "type": "ngram",         <--- ngram, not nGram
          "min_gram": 2,
          "max_gram": 40
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",            <--- text, not string
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}

由于不同的映射类型 had been deprecated 支持通用 _doc 类型，因此您需要调整插入文档的方式。幸运的是，唯一的区别是将 URI 中的 doc 更改为 _doc：

PUT my_index-001/_doc/1
{ "title": "ABBOT Series LTD 2014" }
 
PUT my_index-001/_doc/2
{ "title": "ABBOT PLO LTD 2014A" }
   
PUT my_index-001/_doc/3
{ "title": "ABBOT TXT" } 

PUT my_index-001/_doc/4
{ "title": "ABBOT DMO LTD. 2016-II" }

最后，您的查询完全没有问题，应该按照您期望的方式运行。唯一需要更改的是在查询两个或多个子字符串时将 operator 更改为 and，即：

GET my_index-001/_search
{
  "query": {
    "match": {
      "title": {
        "query": "ABB 2014",
        "operator": "and"
      }
    }
  }
}

除此之外，所有四个测试场景都应该返回您期望的结果。

弹性搜索部分子串搜索

1 个答案: